Opened 14 years ago

Closed 14 years ago

Last modified 13 years ago

#636 closed defect (fixed)

irregex behaves differently with overlapping character sets on some machines

Reported by: sjamaan Owned by: felix winkelmann
Priority: critical Milestone: 4.9.0
Component: core libraries Version: 4.7.x
Keywords: irregex Cc:
Estimated difficulty:

Description

We haven't figured out the cause or anything else yet, but on my NetBSD box, I get the following (expected and correct output):

#;1> (use irregex)
; loading library irregex ...
#;2> (irregex-match 
        '(seq (* white) (submatch (+ print)) 
              (* white) ":" (+ white)
              (submatch (+ print)) (* white))
     "io:event-dispatch")
#f

On call-cc.org (and on another of Mario's machines which is x86-64 Linux) we get this:

#;1> (use irregex)
; loading library irregex ...
#;2> #;2> (irregex-match 
        '(seq (* white) (submatch (+ print)) 
              (* white) ":" (+ white)
              (submatch (+ print)) (* white))
     "io:event-dispatch")
#<regexp-match (2 submatches)>
#;3> (irregex-match-substring #2 1)
"i"
#;4> (irregex-match-substring #2 2)
#f

Change History (9)

comment:1 by Jim Ursetto, 14 years ago

I get the correct output with 4.7.0 on macosx-unix-gnu-x86-64.

comment:2 by sjamaan, 14 years ago

It's also "wrong" with irregex upstream but that's because the regex itself is wrong; it should be "graphic", not "print". "print" includes whitespace!

So this is a bug in qwiki (and I'll fix it asap), but the inconsistent output on the same input remains unexplained.

comment:3 by sjamaan, 14 years ago

Summary: irregex behaves different on some machinesirregex behaves differently with overlapping character sets on some machines

comment:4 by sjamaan, 14 years ago

So the current status is: irregex is okay but in Chicken it sometimes produces different output with the same input if the input is ambiguous.

comment:5 by Alex Shinn, 14 years ago

In a recent commit a bug seems to have been introduced
with the char-set handling.

Going back to version 59 and it works, provided you give
it a regexp that actually matches - in this case, you need
to replace the (+ white) with (* white) or add some whitespace
after the colon.

comment:6 by sjamaan, 14 years ago

The bug appears to be in the nfa->dfa conversion, since the NFA resulting from this SRE is exactly the same before and after the breaking change.

I've managed to reduce it further to an even simpler match, which gets compiled to this DFA:

#;2> (pp (nfa->dfa (sre->nfa (quote (seq (+ print) white)) ~none)))
#((#f (#((#\tab . #\newline) (#\space . #\~)) . 1))
  (#f (#((#\tab . #\tab) (#\! . #\~)) . 2) (#\space . 2) (#\newline . 2))
  (#t (#((#\tab . #\tab) (#\! . #\~)) . 2) (#\space . 2) (#\newline . 2)))

It should look something like this (the accepting state should be left when encountering a non-whitespace character):

#((#f (#((#\tab . #\newline) (#\space . #\~)) . 1))
  (#f (#((#\space . #\~)) . 2) (#\tab . 2) (#\newline . 2))
  (#t (#((#\! . #\~)) . 1) (#\space . 2) #((#\tab . 2) (#\newline . 2))))

It's most likely in the hairy nfa-join-transitions! procedure. I don't understand it anymore so I'll probably just end up rewriting it ;)

comment:7 by sjamaan, 14 years ago

Owner: set to felix winkelmann
Status: newassigned

This bug was rather evil but easy to fix (after hammering on it for three days to pinpoint the cause).

Felix: Could you please review and merge 7d0520e145bc251abe2499af8149ff271ede6b29 from irregex-bugfixes to master?

comment:8 by felix winkelmann, 14 years ago

Resolution: fixed
Status: assignedclosed

Has been merged. Note that master is currently broken (expander/transformer issue).

comment:9 by felix winkelmann, 13 years ago

Milestone: 4.8.04.9.0

Milestone 4.8.0 deleted

Note: See TracTickets for help on using tickets.