#281 closed enhancement (fixed)
promote irregex to full unit
Reported by: | Jim Ursetto | Owned by: | felix winkelmann |
---|---|---|---|
Priority: | major | Milestone: | 4.9.0 |
Component: | core libraries | Version: | 4.5.x |
Keywords: | regex irregex movin-on-up to-the-east-side deluxe-apt sky | Cc: | Alex Shinn |
Estimated difficulty: |
Description
On IRC, we were discussing the odd status of irregex, which requires that you (use regex) and (import irregex) to use it. As regex is now just a wrapper around irregex, and irregex has no dependencies on regex, it seemed to us it might be good to grant irregex full unit status.
The patch attached does so without breaking any existing code. To use irregex, you can continue to
(use regex) (import irregex)
or now simply
(use irregex)
for only the irregex API, or even
(use regex irregex)
for both, which is (IMO) less confusing.
Feel free to chime in, anyone.
Attachments (6)
Change History (49)
Changed 14 years ago by
Attachment: | irregex-promotion.diff.txt added |
---|
comment:1 follow-up: 3 Changed 14 years ago by
comment:2 Changed 14 years ago by
Actually, I much prefer the irregex API over the original one. Mostly because it offers SRE syntax which is vastly superior to string-based regexes (though I understand the regex unit now also accepts those?), and the irregex API offers complete access to all those features (like for example named submatches and working with "match" objects; I'm not sure if the "regex" API has a way to allow access to that).
Finally, I think using irregex directly could help to make code more portable (as irregex seems to work on a shitload of Schemes)
For these reasons I'd really like to see Jim's patch accepted. It would not make a difference for those using the regex API, but it would make life easier for those of us who prefer to use irregex directly.
comment:3 follow-up: 5 Changed 14 years ago by
Replying to felix:
I understand the motivation and it really makes things easier to use, but I'm not entirely happy with this change: I regard irregex as a pure "engine" (a replacement or previous regex engines used in chicken). That the irregex API is exposed just for convenience or performance (and it would be better not to expose it at all, but since its there, its there). By doing so, chicken users depend on it, which would make a replacement (not planned at all, though) impossible. Ieally, the irregex API should be an extension.
Let's talk about this more before making a decision.
Here is my rationale. Primarily, you already have Chicken users relying directly on the irregex API, because the irregex API is superior. Although I still use some regex calls when it is easier, I also rely on the irregex API in almost all new code. regex provides access to basic functions and includes SRE support, but it does not include important functionality such as regex-fold and especially substitution using a procedure as the target instead of a string (this feature is critical). Yes, the irregex API is still changing, and we are even running an old version, but it's still worth using it.
Let's do a Gedankenexperiment and imagine the regex backend is not tied to irregex. Then
- If irregex becomes an extension, how can a core unit (regex) depend on it?
- If we extend the regex API to provide everything that irregex does, does it become just a version of irregex with an incompatible API?
- Must any replacement engine guarantee SRE support?
I actually appreciate your argument because I have occasionally pined for the days of PCRE (!) -- as it is much, much faster than irregex. However, despite that, I have come to rely on SREs and some features of the irregex API. My patch just tried to codify existing user practice, but I'm not wedded to it.
So, let's discuss our options.
comment:4 Changed 14 years ago by
zbigniew, is there a particular reason why you commented out the inclusion of common-declarations.scm
from regex.scm
?
comment:5 Changed 14 years ago by
Cc: | Alex Shinn added; felix winkelmann removed |
---|---|
Milestone: | 4.6.0 |
Owner: | set to felix winkelmann |
Priority: | minor → major |
Status: | new → assigned |
Replying to zbigniew:
Here is my rationale. Primarily, you already have Chicken users relying directly on the irregex API, because the irregex API is superior. Although I still use some regex calls when it is easier, I also rely on the irregex API in almost all new code. regex provides access to basic functions and includes SRE support, but it does not include important functionality such as regex-fold and especially substitution using a procedure as the target instead of a string (this feature is critical). Yes, the irregex API is still changing, and we are even running an old version, but it's still worth using it.
Ok. I accept that the API is much better. But then we should go the whole way, and deprecate (and much later remove) the old regex API.
- If irregex becomes an extension, how can a core unit (regex) depend on it?
I meant the irregex module. It would still be there, but the import library could be available seperately. Just to suggest that it is optional. Probably silly, well.
- If we extend the regex API to provide everything that irregex does, does it become just a version of irregex with an incompatible API?
Yes. I was not suggesting that. The current regexp API is nothing to be proud of and there is no need to put more work into it.
- Must any replacement engine guarantee SRE support?
Yes.
I actually appreciate your argument because I have occasionally pined for the days of PCRE (!) -- as it is much, much faster than irregex. However, despite that, I have come to rely on SREs and some features of the irregex API. My patch just tried to codify existing user practice, but I'm not wedded to it.
I think the best option is to add irregex as a unit, as you suggested, deprecate (but keep) the regex API, move extra-functionality into other units, add the irregex documentation to the manual and upgrade irregex to a newer version. This will make irregex a crucial component of chicken.
I also will start working on optimizing it.
I could use help with some of these things.
comment:6 follow-up: 8 Changed 14 years ago by
I asked Alex about his plans for Irregex, since that info might come in useful for Chicken's planning too:
|11:22| ( sjamaan) foof: What's the roadmap for irregex? Felix is planning to import it soon |11:26| ( foof) 1) refactor and port to chibi module system using include files (allows for easy conditional includes) |11:27| ( foof) 2) write a translator from chibi module system to other module systems that's more robust than the current Makefile hacks I'm using |11:28| ( foof) 3) make chunked strings an optional loaded module |11:29| ( foof) 4) rewrite DFA match extraction to avoid closures - should be insignificantly slower but allow for pre-compiled regexps |11:30| ( foof) 5) when DFA isn't possible, rewrite backtracking algorithm to use non-backtracking NFA (long-term goal) |11:30| ( foof) I can't give accurate times though, my life is insane right now :/ |11:31| ( foof) ... plus additional utilities and optimizations sprinkled in there
There are a couple of patches for two issues I created that have been integrated in irregex's head which we probably want, so I'd recommend going with hg 29:bc015cea652d instead of the actual release.
comment:7 follow-up: 9 Changed 14 years ago by
Since I have no idea on what integrating irregex entails besides replacing irregex.scm, I decided to make myself useful and work on the docs instead.
Here's an initial conversion of irregex's documentation to wiki syntax. I'm not happy with the TOC structuring yet. It doesn't really feel like a Chicken manual page yet. I'd prefer to remove the "specification" section and lift the other sections up one level.
I'm not sure whether to keep the license and references though. I suppose putting the license and/or copyright statement wherever Chicken's is should be fine (it's BSD after all), but I couldn't find such a page in the manual. How did we do that with PCRE?
comment:8 Changed 14 years ago by
Replying to sjamaan:
There are a couple of patches for two issues I created that have been integrated in irregex's head which we probably want, so I'd recommend going with hg 29:bc015cea652d instead of the actual release.
I use version 0.8.0 right now ("total-irregex" branch). If you want, you can send me the necessary patches. Many thanks for the documentation, I've added it to the manual.
Regarding the future direction: Alex is too busy to be of any help here, so let's consider this a fork, particularly because the code has to be optimized and those changes will not make the code more portable or cleaner.
comment:9 Changed 14 years ago by
Replying to sjamaan:
I'm not sure whether to keep the license and references though. I suppose putting the license and/or copyright statement wherever Chicken's is should be fine (it's BSD after all), but I couldn't find such a page in the manual. How did we do that with PCRE?
If it's BSD, we don't have to add a license separately. I don't think we had a separate PCRE license, IIRC.
Changed 14 years ago by
Attachment: | irregex-chicken-and-named-submatch-stacking.patch added |
---|
Adds chicken 4 module declaration, documentation of previously missing functions, allows named submatches to "stack" (see issue 4 of irregex)
comment:10 follow-up: 12 Changed 14 years ago by
I attached a diff of irregex changeset 3becabcea5 (which fixes the issue I mentioned but also adds a module declaration for Chicken instead of the export list of Chicken 3)
The license can indeed be omitted (it's indeed a 3-clause BSD license just like Chicken's).
However, to make it easier for people making binary-only distributions we should repeat the copyright statement somewhere in the docs (or perhaps aggregate all the copyright statements of 3rd party software in one place. Perhaps Chicken's LICENSE file)
comment:11 follow-up: 13 Changed 14 years ago by
Hi, sorry I have been ignoring this ticket. I don't think that the regex API itself has to be deprecated necessarily. It offers some useful convenience functions and I still use it all the time, I just use irregex to fill in the gaps. For example there is no way that I know of in irregex to just return a list of all submatches, which you could then pass to a match-let macro. Instead you must call irregex-match-substring for every single item.
I suppose then that a regex egg could be created, although it would force all core calls to use the irregex API. It sounds like you are okay with this though.
Let me know what you need me to do for this ticket.
comment:12 follow-up: 16 Changed 14 years ago by
Replying to sjamaan:
I attached a diff of irregex changeset 3becabcea5 (which fixes the issue I mentioned but also adds a module declaration for Chicken instead of the export list of Chicken 3)
Thanks - I've added it.
However, to make it easier for people making binary-only distributions we should repeat the copyright statement somewhere in the docs (or perhaps aggregate all the copyright statements of 3rd party software in one place. Perhaps Chicken's LICENSE file)
Ok. Can you create a ticket for that?
comment:13 Changed 14 years ago by
Replying to zbigniew:
Hi, sorry I have been ignoring this ticket. I don't think that the regex API itself has to be deprecated necessarily. It offers some useful convenience functions and I still use it all the time, I just use irregex to fill in the gaps. For example there is no way that I know of in irregex to just return a list of all submatches, which you could then pass to a match-let macro. Instead you must call irregex-match-substring for every single item.
I suppose then that a regex egg could be created, although it would force all core calls to use the irregex API. It sounds like you are okay with this though.
Thanks for the suggestion - I way quite clueless about how to get rid of regex without breaking everything. Making regex an extension should be possible.
Let me know what you need me to do for this ticket.
It's all quite a piece of work, but I get along well.
comment:14 Changed 14 years ago by
Status: | assigned → accepted |
---|
comment:15 Changed 14 years ago by
Keeping regex as an egg also ensures that the existing idiom:
(use regex) (import irregex)
continues to work, I think.
comment:16 Changed 14 years ago by
Replying to felix:
However, to make it easier for people making binary-only distributions we should repeat the copyright statement somewhere in the docs (or perhaps aggregate all the copyright statements of 3rd party software in one place. Perhaps Chicken's LICENSE file)
Ok. Can you create a ticket for that?
Done (as ticket #291)
Don't bother yourself with it, until I post a new LICENSE file, which you can then review.
comment:17 Changed 14 years ago by
The total-irregex
branch in the git repository is more or less complete now and represents irregex 0.8.2. The tests seem to run ok. I have tried to tune the code with chicken-specific forms (guarded via cond-expand
). There still is a lot that can be optimized. I would also like to run mini-salmonella
once before declaring the branch as done.
The old regex API has been moved into an egg, you can see it here:
http://github.com/bunny351/protoeggs/tree/master/regex/
I'm not sure how to handle the egg: should it be shipped with the core libs or should we forcefully modify the .meta files of all eggs that use it in a big update spree? (that should be backwards-compatible, since chicken-install
should ignore any dependencies on regex
.
Sorry, I can't get the benchmarks in uri-generic/trunk/alternatives
to run properly: uri-generic.abnf.scm
accesses undefined identifiers (has abnf been updated recently?) and uri-generic.irregex.scm
gives errors about named submatches it can't find.
comment:18 follow-up: 22 Changed 14 years ago by
I know Ivan has changed abnf a lot (I saw lots of changes when svn up'ing), but I don't know what exactly he changed. I'd have to dig in to fix that. Not sure if it's worth it..
The irregex errors happen only with the new chicken, and that's because irregex now returns an error for nonmatched named submatches. It seems to me that that is a bug:
(define m (irregex-match `(or (submatch-named foo "foo") (submatch-named bar "bar")) "foo")) (irregex-match-substring m 'foo) => "foo" (irregex-match-substring m 'bar) => #f ;; In Chicken 4.5.0 (irregex-match-substring m 'bar) => Error: unknown match name: bar ;; In total-irregex
This completely breaks backwards compatibility, since I used the fact that it returned #f to simplify the code to check for which alternative was used:
(or (irregex-match-substring m 'foo) (irregex-match-substring m 'bar))
This would always give me either "foo", "bar" or #f depending on what matched, if anything. In code this is very convenient. Much more convenient than catching an exception to handle the case where it didn't match or requesting all the names and checking whether the match exists before extracting it.
I hope Alex reads this, and can pitch in.
comment:19 Changed 14 years ago by
This is indeed a regression (Alex already said he thought it probably was a bug). Here's a patch for irregex.
Changed 14 years ago by
Attachment: | irregex-named-submatches.diff added |
---|
Fixes extraction of existing, but nonmatching named submatches
comment:20 follow-up: 23 Changed 14 years ago by
Here's the same patch, but for Chicken. After applying and compiling, uri-generic works again.
Changed 14 years ago by
Attachment: | total-irregex-named-submatches.diff added |
---|
Named submatches fix for Chicken
comment:21 follow-up: 24 Changed 14 years ago by
If regex is shipped as an egg, we should update the internal egg version number. Otherwise people will upgrade their Chicken and their old installed eggs that use regex will start failing.
Alternatively, we could ship it with core for now and take the opportunity of the next internal egg version bump of putting it in an egg. I'd prefer this approach.
In the meanwhile, the eggs could already be updated right now. If requiring a nonexistant "regex" egg doesn't make them break if the regex lib is part of chicken we will have a decent transition period with no breakage at all. nice!
comment:22 Changed 14 years ago by
This bug was caused by the recent "stacking" of named submatches feature
in 0.8.2. Sorry about that - I've fixed it upstream.
comment:23 Changed 14 years ago by
Replying to sjamaan:
Here's the same patch, but for Chicken. After applying and compiling, uri-generic works again.
Thanks, patch is applied.
comment:24 Changed 14 years ago by
Replying to sjamaan:
If regex is shipped as an egg, we should update the internal egg version number. Otherwise people will upgrade their Chicken and their old installed eggs that use regex will start failing.
Indeed - I didn't think of that. With "internal egg version number" you mean the "5" in "$(PREFIX)/lib/chicken/5", right?
Alternatively, we could ship it with core for now and take the opportunity of the next internal egg version bump of putting it in an egg. I'd prefer this approach.
It would be the safest. Alternatively the .setup script of the regex egg could check the installed chicken version and act accordingly.
comment:25 follow-up: 35 Changed 14 years ago by
Ivan told me that the old abnf API is available from the abnf-charlist module now. FWIW, I've updated the uri-generic abnf alternative code so it uses that, now this works again.
comment:26 follow-up: 29 Changed 14 years ago by
With internal egg version number I did indeed mean the "5" in your example. Does it have a proper name?
Shall I start adding "regex" to the meta files?
comment:27 follow-up: 31 Changed 14 years ago by
Here's another patch. Alex said the irregex-match-valid-index? procedure ought to be exported, so that's what this patch does.
I also added named submatch checking to irregex-match-valid-index?. By doing this, things like irregex-match-start-index now correctly accept an index-or-name instead of just a numeric index, so their behaviour now matches what the documentation says.
It's also slightly more consistent because the index is now optional for all six irregex-match-{start,end}-{index,chunk} procedures, returning the full match (index 0) by default.
Changed 14 years ago by
Attachment: | match-index-valid.diff added |
---|
Export match-index-valid? and make it and some other procedures accept named submatches too
comment:28 follow-up: 30 Changed 14 years ago by
BTW, I don't think I understand types.db so I didn't provide a definition for irregex-match-valid-index? yet. Would something like this do?
(irregex-match-valid-index? (procedure irregex-match-valid-index? ((struct regexp-match) *) boolean))
comment:29 follow-up: 34 Changed 14 years ago by
Replying to sjamaan:
With internal egg version number I did indeed mean the "5" in your example. Does it have a proper name?
Well, I call it "binary version number", but it's not important, I just wanted to avoid misinterpreting your comment.
Shall I start adding "regex" to the meta files?
Not yet, the egg has to be added first. chicken-install
will ignore regex only in relatively recent chicken installations.
comment:30 follow-up: 32 Changed 14 years ago by
Replying to sjamaan:
BTW, I don't think I understand types.db so I didn't provide a definition for irregex-match-valid-index? yet. Would something like this do?
(irregex-match-valid-index? (procedure irregex-match-valid-index? ((struct regexp-match) *) boolean))
Yep, that's right. The type format is documented in manual/Declarations
, IIRC.
comment:31 follow-up: 33 Changed 14 years ago by
Replying to sjamaan:
Here's another patch. Alex said the irregex-match-valid-index? procedure ought to be exported, so that's what this patch does.
I also added named submatch checking to irregex-match-valid-index?. By doing this, things like irregex-match-start-index now correctly accept an index-or-name instead of just a numeric index, so their behaviour now matches what the documentation says.
It's also slightly more consistent because the index is now optional for all six irregex-match-{start,end}-{index,chunk} procedures, returning the full match (index 0) by default.
Is this applied to the irregex HEAD? I checked the mercurial tip yesterday and spotted a few differences.
comment:32 follow-up: 36 Changed 14 years ago by
Replying to felix:
Replying to sjamaan:
BTW, I don't think I understand types.db so I didn't provide a definition for irregex-match-valid-index? yet. Would something like this do?
(irregex-match-valid-index? (procedure irregex-match-valid-index? ((struct regexp-match) *) boolean))
Yep, that's right. The type format is documented in
manual/Declarations
, IIRC.
What I don't understand is the need for repeating the procedure name. This is also not in the Declarations document: the type declaration (procedure irregex-match-valid-index? ((struct regexp-match) *) boolean)
does not match the production (procedure (VAL1 ... [#!optional VALOPT1 ...] [#!rest [VAL]]) . RESULTS)
comment:33 follow-up: 37 Changed 14 years ago by
Replying to felix:
Is this applied to the irregex HEAD? I checked the mercurial tip yesterday and spotted a few differences.
Yes, this is the same patch I made here. I had to move the irregex-match-valid-numeric-index?
in the file because of the restriction on define-inline
having to appear before any usages (took me a while to figure that out! A compiler error would be nicer than a runtime segfault ;) ) but otherwise it's the same patch I posted here.
comment:34 follow-up: 38 Changed 14 years ago by
Replying to felix:
Replying to sjamaan:
Shall I start adding "regex" to the meta files?
Not yet, the egg has to be added first.
chicken-install
will ignore regex only in relatively recent chicken installations.
Apologies if this thread is becoming unmanageable :(
I was wondering about this: what happens to eggs using irregex? They currently (use regex) or (require-library regex) and (import irregex). This would stop working when regex is its own egg, wouldn't it? So eggs using irregex would need to depend on the regex egg during some transition period. It wouldn't be too great if they all started depending on a bleeding edge version of Chicken since several OSes might still have 4.5.0 or older in their package systems.
What would be an acceptable transition period?
comment:35 Changed 14 years ago by
Replying to sjamaan:
Ivan told me that the old abnf API is available from the abnf-charlist module now. FWIW, I've updated the uri-generic abnf alternative code so it uses that, now this works again.
Thanks. Here some benchmark results (chicken 4.5.7, "total-irregex" branch):
% ~/chicken-core-fast/bin/csi -nq uri-generic.matchable.so run.scm 0.984s CPU time, 0.12s GC time (major), 1320149 mutations, 87/15669 GCs (major/minor) #;1> ,q % ~/chicken-core-fast/bin/csi -bnq uri-generic.abnf.so run.scm 4.628s CPU time, 1.032s GC time (major), 570074 mutations, 395/56499 GCs (major/minor) % ~/chicken-core-fast/bin/csi -bnq uri-generic.irregex.so run.scm 5.208s CPU time, 0.62s GC time (major), 640081 mutations, 206/122146 GCs (major/minor) % ~/chicken-core-fast/bin/csi -bnq uri-generic.packrat.so run.scm 8.576s CPU time, 2.852s GC time (major), 3100327 mutations, 1382/83770 GCs (major/minor)
comment:36 Changed 14 years ago by
Replying to sjamaan:
What I don't understand is the need for repeating the procedure name. This is also not in the Declarations document: the type declaration
(procedure irregex-match-valid-index? ((struct regexp-match) *) boolean)
does not match the production(procedure (VAL1 ... [#!optional VALOPT1 ...] [#!rest [VAL]]) . RESULTS)
The name is optional (forgot to document this, will now) and is required to get at the procedure name even if the type propagated through various expressions in the flow-analysis done by "-scrutiny".
comment:37 Changed 14 years ago by
Replying to sjamaan:
Yes, this is the same patch I made here. I had to move the
irregex-match-valid-numeric-index?
in the file because of the restriction ondefine-inline
having to appear before any usages (took me a while to figure that out! A compiler error would be nicer than a runtime segfault ;) ) but otherwise it's the same patch I posted here.
Noted.
comment:38 follow-up: 40 Changed 14 years ago by
Replying to sjamaan:
Apologies if this thread is becoming unmanageable :(
I think it's great fun.
I was wondering about this: what happens to eggs using irregex? They currently (use regex) or (require-library regex) and (import irregex). This would stop working when regex is its own egg, wouldn't it? So eggs using irregex would need to depend on the regex egg during some transition period. It wouldn't be too great if they all started depending on a bleeding edge version of Chicken since several OSes might still have 4.5.0 or older in their package systems.
That's right, but both should work. (require-library regex)
or (require-extension regex)
should be a noop on older chicken systems. chicken-install regex
(or indirect installations due to dependencies) will install the new egg, but its .setup script will do nothing if the current chicken does include the regex core unit (version < X.X.X) or install the regex extension otherwise. (import irregex)
should work for all chicken versions. What will not work on newer chickens is (begin (require-library irregex) (import regex))
(but that may be somewhat contrived).
My head is spinning. This just has to be tested extensively.
comment:39 Changed 14 years ago by
Milestone: | → 4.7.0 |
---|
comment:40 Changed 14 years ago by
Replying to felix:
Replying to sjamaan:
Apologies if this thread is becoming unmanageable :(
I think it's great fun.
:P
That's right, but both should work.
(require-library regex)
or(require-extension regex)
should be a noop on older chicken systems.
What I mean is this:
When can eggs start using (use irregex), and does the old syntax mean eggs using irregex will now require an extra "regex" egg to be installed even if they just (require-library regex) (import irregex)?
My head is spinning. This just has to be tested extensively.
Yeah, it's tricky, but it can't be tested easily since there are different situations:
1 - old egg, old chicken (current situation)
2 - old egg, new chicken
3 - new egg, old chicken
4 - new egg, new chicken
And that in two different situations for each (eggs using regex and eggs using irregex)
comment:41 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
Has been merged into "experimental". As a test I have added a dependency on regex
(the egg) to the regex-case
egg to test the handling of egg-deps on regex
. Old (pre 4.5.0) chickens should try to install it, but the .setup script will just run to completion and not install anything. 4.5.0 will ignore any dependencies on regex
. "experimental" will install it as usual. From the little testing I did, this seems to work.
After the next minor release, this should be announced on chicken-users and add dependencies to regex
to all eggs that use the old regex API.
To keep backwards-compatibility, eggs should still do the old idiom:
(use regex) (import irregex)
I understand the motivation and it really makes things easier to use, but I'm not entirely happy with this change: I regard irregex as a pure "engine" (a replacement or previous regex engines used in chicken). That the irregex API is exposed just for convenience or performance (and it would be better not to expose it at all, but since its there, its there). By doing so, chicken users depend on it, which would make a replacement (not planned at all, though) impossible. Ieally, the irregex API should be an extension.
Let's talk about this more before making a decision.