Opened 10 years ago

Closed 10 years ago

Last modified 9 years ago

#636 closed defect (fixed)

irregex behaves differently with overlapping character sets on some machines

Reported by: sjamaan Owned by: felix winkelmann
Priority: critical Milestone: 4.9.0
Component: core libraries Version: 4.7.x
Keywords: irregex Cc:
Estimated difficulty:

Description

We haven't figured out the cause or anything else yet, but on my NetBSD box, I get the following (expected and correct output):

#;1> (use irregex)
; loading library irregex ...
#;2> (irregex-match 
        '(seq (* white) (submatch (+ print)) 
              (* white) ":" (+ white)
              (submatch (+ print)) (* white))
     "io:event-dispatch")
#f

On call-cc.org (and on another of Mario's machines which is x86-64 Linux) we get this:

#;1> (use irregex)
; loading library irregex ...
#;2> #;2> (irregex-match 
        '(seq (* white) (submatch (+ print)) 
              (* white) ":" (+ white)
              (submatch (+ print)) (* white))
     "io:event-dispatch")
#<regexp-match (2 submatches)>
#;3> (irregex-match-substring #2 1)
"i"
#;4> (irregex-match-substring #2 2)
#f

Change History (9)

comment:1 Changed 10 years ago by Jim Ursetto

I get the correct output with 4.7.0 on macosx-unix-gnu-x86-64.

comment:2 Changed 10 years ago by sjamaan

It's also "wrong" with irregex upstream but that's because the regex itself is wrong; it should be "graphic", not "print". "print" includes whitespace!

So this is a bug in qwiki (and I'll fix it asap), but the inconsistent output on the same input remains unexplained.

comment:3 Changed 10 years ago by sjamaan

Summary: irregex behaves different on some machinesirregex behaves differently with overlapping character sets on some machines

comment:4 Changed 10 years ago by sjamaan

So the current status is: irregex is okay but in Chicken it sometimes produces different output with the same input if the input is ambiguous.

comment:5 Changed 10 years ago by Alex Shinn

In a recent commit a bug seems to have been introduced
with the char-set handling.

Going back to version 59 and it works, provided you give
it a regexp that actually matches - in this case, you need
to replace the (+ white) with (* white) or add some whitespace
after the colon.

comment:6 Changed 10 years ago by sjamaan

The bug appears to be in the nfa->dfa conversion, since the NFA resulting from this SRE is exactly the same before and after the breaking change.

I've managed to reduce it further to an even simpler match, which gets compiled to this DFA:

#;2> (pp (nfa->dfa (sre->nfa (quote (seq (+ print) white)) ~none)))
#((#f (#((#\tab . #\newline) (#\space . #\~)) . 1))
  (#f (#((#\tab . #\tab) (#\! . #\~)) . 2) (#\space . 2) (#\newline . 2))
  (#t (#((#\tab . #\tab) (#\! . #\~)) . 2) (#\space . 2) (#\newline . 2)))

It should look something like this (the accepting state should be left when encountering a non-whitespace character):

#((#f (#((#\tab . #\newline) (#\space . #\~)) . 1))
  (#f (#((#\space . #\~)) . 2) (#\tab . 2) (#\newline . 2))
  (#t (#((#\! . #\~)) . 1) (#\space . 2) #((#\tab . 2) (#\newline . 2))))

It's most likely in the hairy nfa-join-transitions! procedure. I don't understand it anymore so I'll probably just end up rewriting it ;)

comment:7 Changed 10 years ago by sjamaan

Owner: set to felix winkelmann
Status: newassigned

This bug was rather evil but easy to fix (after hammering on it for three days to pinpoint the cause).

Felix: Could you please review and merge 7d0520e145bc251abe2499af8149ff271ede6b29 from irregex-bugfixes to master?

comment:8 Changed 10 years ago by felix winkelmann

Resolution: fixed
Status: assignedclosed

Has been merged. Note that master is currently broken (expander/transformer issue).

comment:9 Changed 9 years ago by felix winkelmann

Milestone: 4.8.04.9.0

Milestone 4.8.0 deleted

Note: See TracTickets for help on using tickets.