#636 closed defect (fixed)
irregex behaves differently with overlapping character sets on some machines
Reported by: | sjamaan | Owned by: | felix winkelmann |
---|---|---|---|
Priority: | critical | Milestone: | 4.9.0 |
Component: | core libraries | Version: | 4.7.x |
Keywords: | irregex | Cc: | |
Estimated difficulty: |
Description
We haven't figured out the cause or anything else yet, but on my NetBSD box, I get the following (expected and correct output):
#;1> (use irregex) ; loading library irregex ... #;2> (irregex-match '(seq (* white) (submatch (+ print)) (* white) ":" (+ white) (submatch (+ print)) (* white)) "io:event-dispatch") #f
On call-cc.org (and on another of Mario's machines which is x86-64 Linux) we get this:
#;1> (use irregex) ; loading library irregex ... #;2> #;2> (irregex-match '(seq (* white) (submatch (+ print)) (* white) ":" (+ white) (submatch (+ print)) (* white)) "io:event-dispatch") #<regexp-match (2 submatches)> #;3> (irregex-match-substring #2 1) "i" #;4> (irregex-match-substring #2 2) #f
Change History (9)
comment:1 Changed 13 years ago by
comment:2 Changed 13 years ago by
It's also "wrong" with irregex upstream but that's because the regex itself is wrong; it should be "graphic", not "print". "print" includes whitespace!
So this is a bug in qwiki (and I'll fix it asap), but the inconsistent output on the same input remains unexplained.
comment:3 Changed 13 years ago by
Summary: | irregex behaves different on some machines → irregex behaves differently with overlapping character sets on some machines |
---|
comment:4 Changed 13 years ago by
So the current status is: irregex is okay but in Chicken it sometimes produces different output with the same input if the input is ambiguous.
comment:5 Changed 13 years ago by
In a recent commit a bug seems to have been introduced
with the char-set handling.
Going back to version 59 and it works, provided you give
it a regexp that actually matches - in this case, you need
to replace the (+ white) with (* white) or add some whitespace
after the colon.
comment:6 Changed 13 years ago by
The bug appears to be in the nfa->dfa conversion, since the NFA resulting from this SRE is exactly the same before and after the breaking change.
I've managed to reduce it further to an even simpler match, which gets compiled to this DFA:
#;2> (pp (nfa->dfa (sre->nfa (quote (seq (+ print) white)) ~none))) #((#f (#((#\tab . #\newline) (#\space . #\~)) . 1)) (#f (#((#\tab . #\tab) (#\! . #\~)) . 2) (#\space . 2) (#\newline . 2)) (#t (#((#\tab . #\tab) (#\! . #\~)) . 2) (#\space . 2) (#\newline . 2)))
It should look something like this (the accepting state should be left when encountering a non-whitespace character):
#((#f (#((#\tab . #\newline) (#\space . #\~)) . 1)) (#f (#((#\space . #\~)) . 2) (#\tab . 2) (#\newline . 2)) (#t (#((#\! . #\~)) . 1) (#\space . 2) #((#\tab . 2) (#\newline . 2))))
It's most likely in the hairy nfa-join-transitions!
procedure. I don't understand it anymore so I'll probably just end up rewriting it ;)
comment:7 Changed 13 years ago by
Owner: | set to felix winkelmann |
---|---|
Status: | new → assigned |
This bug was rather evil but easy to fix (after hammering on it for three days to pinpoint the cause).
Felix: Could you please review and merge 7d0520e145bc251abe2499af8149ff271ede6b29 from irregex-bugfixes to master?
comment:8 Changed 13 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Has been merged. Note that master is currently broken (expander/transformer issue).
I get the correct output with 4.7.0 on macosx-unix-gnu-x86-64.