#254 closed defect (fixed)
wiki-parse deficiencies
Reported by: | Moritz Heidkamp | Owned by: | sjamaan |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | extensions | Version: | 4.5.0 |
Keywords: | wiki-parse | Cc: | Ivan Raikov |
Estimated difficulty: |
Description
This ticket is supposed to act as a collection of parsing errors in wiki-parse.
- definition lists result in multiple <dl>s with one <dt>/<dd> pair each instead of one lists with multiple pairs. See http://209.172.49.65/man/4/The%20User%27s%20Manual for an example (oddly, the lists on http://209.172.49.65/ are rendered correctly).
- enscript tags should be converted to something like <pre><code>
- handle nowiki in some reasonable way
Change History (32)
comment:1 Changed 14 years ago by
comment:2 follow-up: 4 Changed 14 years ago by
wiki-parse has quite a few other deficiencies, such as inability to handle nested lists. Assuming you want to keep svnwiki syntax exactly the same, an option is to just scrap it and use svnwiki-sxml from chicken-doc. You can also look separately at chicken-doc-html.scm in chickadee for ideas of how to render the SXML to HTML.
comment:3 Changed 14 years ago by
... and if you want to use svnwiki-sxml, I will pull it out into a separate egg.
comment:4 follow-ups: 7 9 10 Changed 14 years ago by
Replying to zbigniew:
wiki-parse has quite a few other deficiencies, such as inability to handle nested lists. Assuming you want to keep svnwiki syntax exactly the same, an option is to just scrap it and use svnwiki-sxml from chicken-doc. You can also look separately at chicken-doc-html.scm in chickadee for ideas of how to render the SXML to HTML.
I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee
If the svnwiki parser of chickadee is as complete as wiki-parse, an egg for it would be much appreciated. wiki-parse is buggy and kind of slow IMO.
comment:5 follow-up: 6 Changed 14 years ago by
Replying to syn:
This ticket is supposed to act as a collection of parsing errors in wiki-parse.
- definition lists result in multiple <dl>s with one <dt>/<dd> pair each instead of one lists with multiple pairs. See http://209.172.49.65/man/4/The%20User%27s%20Manual for an example (oddly, the lists on http://209.172.49.65/ are rendered correctly).
- enscript tags should be converted to something like <pre><code>
Agreed
- handle nowiki in some reasonable way
What's unreasonable about the way it's handled now? The chicken logo and ohloh factoids sheet on the index page are both nowiki and look fine to me.
comment:6 Changed 14 years ago by
Replying to sjamaan:
- handle nowiki in some reasonable way
What's unreasonable about the way it's handled now? The chicken logo and ohloh factoids sheet on the index page are both nowiki and look fine to me.
I was just being dumb, of course. I was referring to tables like for example on http://209.172.49.65/portability and assumed that it would involve nowiki tags. Apparently it doesn't for what ever reason :) So let me rephrase: handle tables in some reasonable way.
comment:7 Changed 14 years ago by
Replying to sjamaan:
I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee
Try git clone http://3e8.org/chickadee.git
comment:8 Changed 14 years ago by
Definition lists: the problem you are describing is caused by the fact that there is a newline between every element of the list. The current wiki-parse is line- and regexp- based, and that means this issue can only be fixed by some heuristic matching, which would complicate the code. I should also note here that MediaWiki? does the same thing if the elements of a definition list are separated by newlines.
Enscript tags: enscript is actually for syntax highlighting; preformatted code can be represented by indented text (same as svnwiki and MediaWiki?). I suggest adopting a separate tag for syntax highlighting that can be handled by corresponding extensions in the rendering code, and maybe avoiding enscript altogether.
As for the other issues, I would be happy to switch away from the regexp-oriented parsing, which has led to difficulties in code maintenance, but I think we should also adopt more reasonable and well-defined syntax rules, such as RST or MediaWiki?. svnwiki tables, in particular, are a mess, which is why I have chosen to use MediaWiki? table syntax in wiki-parse.
comment:9 follow-up: 13 Changed 14 years ago by
Out of curiosity, have you done any concrete benchmarking of wiki-parse? My only test case besides the Chicken manual is quite small: about 2.3MB of wiki content, divided in about 160 files, which seem to be parsed in 2-3 seconds on my system. The Chicken manual is about 1.8MB currently.
Replying to sjamaan:
Replying to zbigniew:
wiki-parse has quite a few other deficiencies, such as inability to handle nested lists. Assuming you want to keep svnwiki syntax exactly the same, an option is to just scrap it and use svnwiki-sxml from chicken-doc. You can also look separately at chicken-doc-html.scm in chickadee for ideas of how to render the SXML to HTML.
I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee
If the svnwiki parser of chickadee is as complete as wiki-parse, an egg for it would be much appreciated. wiki-parse is buggy and kind of slow IMO.
comment:10 Changed 14 years ago by
Replying to sjamaan:
I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee
If the svnwiki parser of chickadee is as complete as wiki-parse, an egg for it would be much appreciated. wiki-parse is buggy and kind of slow IMO.
The svnwiki->sxml parser is part of chicken-doc-admin. (Node data is stored as sxml.) svnwiki-sxml->text is included in chicken-doc. svnwiki-sxml->html is part of chickadee. You want the first one, which I will break out into an egg. But you can look at the others if you want an example for html output.
You can see the advantages for yourself in the chickadee output. Including the ability to parse lists that have embedded newlines. It's not the best parser, but it seems to do the job.
It's fine to improve the svnwiki syntax, but I would aim for backward compatibility, as Felix has recommended. If you do this, the changes can be integrated into svnwiki-sxml.
comment:11 Changed 14 years ago by
I can't guarantee it's fast, though. It's probably not any faster than wiki-parse.
comment:12 Changed 14 years ago by
Sorry for the reply spam, but I forgot to mention that, if you install chicken-doc-admin, the svnwiki-sxml egg will be installed as well. It's fully eggified already, just "private" to chicken-doc-admin. So you can play around with it without any action on my part. In fact you can simply "chicken-install svnwiki-sxml" from within the chicken-doc-admin directory, if you don't want to install the latter.
comment:13 Changed 14 years ago by
Replying to iraikov:
Out of curiosity, have you done any concrete benchmarking of wiki-parse? My only test case besides the Chicken manual is quite small: about 2.3MB of wiki content, divided in about 160 files, which seem to be parsed in 2-3 seconds on my system. The Chicken manual is about 1.8MB currently.
I've taken a random sample from the wiki contents to test qwiki with, and I noticed that especially the page for 9p (eggref/4/9p) is extremely slow. I just tested this and it turns out that it takes about 13 seconds to generate the full html page:
#;7> (time (qwiki-update-file! '("9p"))) 14.809 seconds elapsed 5.268 seconds in (major) GC 2603950 mutations 212 minor GCs 194 major GCs
I was wrong thinking it was all caused by wiki-parse, though!
#;2> (time (call-with-input-file "/tmp/qwiki/9p" wiki-parse)) 0.969 seconds elapsed 0.025 seconds in (major) GC 16028 mutations 1522 minor GCs 1 major GCs
I think one second for parsing is a bit on the slow side, but acceptable.
zbigniew: it turns out svnwiki->sxml is faster, not slower (on this particular file) than wiki-parse :)
#;4> (time (call-with-input-file "/tmp/qwiki/9p" svnwiki->sxml)) 0.24 seconds elapsed 0.016 seconds in (major) GC 7161 mutations 1085 minor GCs 1 major GCs
comment:14 Changed 14 years ago by
Crap, it's the search index updater that takes so much time:
(time (qwiki-update-file! '("9p"))) 2.444 seconds elapsed 0.87 seconds in (major) GC 176608 mutations 163 minor GCs 28 major GCs #;16> (search-install!) #;17> (time (qwiki-update-file! '("9p"))) 17.597 seconds elapsed 8.018 seconds in (major) GC 2600848 mutations 262 minor GCs 271 major GCs
Next suspects: http-client, estraier-client or the estraier master.
comment:15 Changed 14 years ago by
After disabling the actual request in estraier-client:
(time (qwiki-update-file! '("9p"))) 13.429 seconds elapsed 3.484 seconds in (major) GC 2732397 mutations 155 minor GCs 108 major GCs
After disabling the call to put-document in qwiki-search:
(time (qwiki-update-file! '("9p"))) 4.555 seconds elapsed 2.262 seconds in (major) GC 342063 mutations 238 minor GCs 69 major GCs
I guess that rules out http-client and qwiki-search and leaves estraier-client.
comment:16 Changed 14 years ago by
Much better now! :)
Turns out that string-substitute is many, many times slower than irregex-replace/all.
(time (qwiki-update-file! '("9p"))) 2.269 seconds elapsed 0.196 seconds in (major) GC 327123 mutations 184 minor GCs 6 major GCs #;24> (search-install!) #;25> (time (qwiki-update-file! '("9p"))) 4.1 seconds elapsed 0.409 seconds in (major) GC 752288 mutations 474 minor GCs 13 major GCs
mario should just update estraier-client to version 0.2 and it'll be acceptably fast.
comment:17 Changed 14 years ago by
This link might be of interest:
http://www.wikicreole.org/
WikiCreole? as an attempt to standardize wiki syntax. There is a complete BNF specification for WikiCreole?, and it is not very far from that of svnwiki. I think a real grammar is a much better than the error-prone heuristic parsing with regexes that is in use with most wiki engines.
Replying to syn:
This ticket is supposed to act as a collection of parsing errors in wiki-parse.
- definition lists result in multiple <dl>s with one <dt>/<dd> pair each instead of one lists with multiple pairs. See http://209.172.49.65/man/4/The%20User%27s%20Manual for an example (oddly, the lists on http://209.172.49.65/ are rendered correctly).
- enscript tags should be converted to something like <pre><code>
- handle nowiki in some reasonable way
comment:18 Changed 14 years ago by
There is no separate EBNF for the Creole parser. There's only the ANTLR code which implements that, published at http://wikicreole.sf.net
Unfortunately this code that now falls under the AGPL which I wouldn't want to touch with a ten-foot pole.
comment:19 follow-up: 20 Changed 14 years ago by
Since this wikicreole thing seems to be going nowhere I decided to hack on multidoc & qwiki to replace wiki-parse with svnwiki-sxml. So far it seems to work better and more reliably. I've also changed the TOC to output a nested list structure instead of a flat list so this works more like svnwiki too.
You can find this code in qwiki/branches/svnwiki-sxml and multidoc/branches/svnwiki-sxml.
Ivan: I'd appreciate it if you could have a look at the LaTeX and texinfo code. It seems to be broken (even for the wiki-parse version). I've hacked it a little to get it to at least output some stuff again with svnwiki-sxml, but it's not working properly (for example the document headers are not included)
mario: Could you update qwiki and multidoc on call-cc.org? You will have to remove all references to qwiki-nowiki, since svnwiki-sxml takes care of the nowiki parsing all by itself (@ivan & zbigniew: this may cause some trouble with the LaTeX and texinfo output)
comment:20 follow-up: 22 Changed 14 years ago by
Ok, I will look at that. I use the LaTeX backend with wiki-parse generated output on an almost daily basis, so it should be working, but maybe I am using some additional templates that you don't have. I will check and let you know.
Replying to sjamaan:
Ivan: I'd appreciate it if you could have a look at the LaTeX and texinfo code. It seems to be broken (even for the wiki-parse version). I've hacked it a little to get it to at least output some stuff again with svnwiki-sxml, but it's not working properly (for example the document headers are not included)
comment:21 Changed 14 years ago by
Just for clarification: I think it's the qwiki-specific LaTeX and texinfo generation that's broken. The multidoc part is okay.
comment:22 follow-up: 23 Changed 14 years ago by
The LaTeX backend should be working now, I have committed a fix in the qwiki trunk. Let me know if you want me to merge this in your branch also.
Replying to iraikov:
Ok, I will look at that. I use the LaTeX backend with wiki-parse generated output on an almost daily basis, so it should be working, but maybe I am using some additional templates that you don't have. I will check and let you know.
Replying to sjamaan:
Ivan: I'd appreciate it if you could have a look at the LaTeX and texinfo code. It seems to be broken (even for the wiki-parse version). I've hacked it a little to get it to at least output some stuff again with svnwiki-sxml, but it's not working properly (for example the document headers are not included)
comment:23 Changed 14 years ago by
Replying to iraikov:
The LaTeX backend should be working now, I have committed a fix in the qwiki trunk. Let me know if you want me to merge this in your branch also.
I'd appreciate that. I intend to merge the qwiki and multidoc "svnwiki-sxml" branches into their respective trunks sometime soon, unless there are any objections. svnwiki-sxml is a much better parser and I've already made some other important changes in that branch.
I think we should mark the wiki-parse egg as deprecated or even remove it from the egg repo altogether.
comment:24 follow-up: 25 Changed 14 years ago by
I saw you've pulled the changes you made into the branch.
I'd like to reintegrate the multidoc & qwiki branches into their trunks again. Is that alright with you, Ivan?
comment:25 follow-up: 26 Changed 14 years ago by
Yes, I looked at the code and it seems ok. One minor thing is that the `section' rule starts with a lowercase letter in the HTML backend and with an uppercase letter in the LaTeX backend. We should probably choose a consistent capitalization for all rules.
Replying to sjamaan:
I saw you've pulled the changes you made into the branch.
I'd like to reintegrate the multidoc & qwiki branches into their trunks again. Is that alright with you, Ivan?
comment:26 follow-up: 27 Changed 14 years ago by
Replying to iraikov:
Yes, I looked at the code and it seems ok. One minor thing is that the `section' rule starts with a lowercase letter in the HTML backend and with an uppercase letter in the LaTeX backend. We should probably choose a consistent capitalization for all rules.
That's odd, I don't see that. Are you sure you're looking at the branch code? svnwiki-sxml generates all-lowercase section
elements, so if there's a capital somewhere it's probably broken.
comment:27 Changed 14 years ago by
Ok, I see now that I was looking at the trunk files instead of the svnwiki-sxml branch. I think that the examples and highlight rule definitely do not belong in the core multidoc, but should be instead part of svnwiki and its extensions. Particularly highlight, something like this would almost certainly be handled by external parsers.
Replying to sjamaan:
Replying to iraikov:
Yes, I looked at the code and it seems ok. One minor thing is that the `section' rule starts with a lowercase letter in the HTML backend and with an uppercase letter in the LaTeX backend. We should probably choose a consistent capitalization for all rules.
That's odd, I don't see that. Are you sure you're looking at the branch code? svnwiki-sxml generates all-lowercase
section
elements, so if there's a capital somewhere it's probably broken.
comment:28 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
I've copied over the files from the svnwiki-sxml branches of multidoc and qwiki to their respective trunks and closed the branches.
We could remove or hide the wiki-parse egg. I'll leave that up to you, Ivan.
comment:29 follow-up: 30 Changed 14 years ago by
PS: You're absolutely right about the highlight and examples. I've moved them to qwiki. The highlighting now uses my new "colorize" egg for highlighting, but that only handles html output. For LaTeX and Texinfo it still simply puts the code in a preformatted block.
I'm not sure how and IF we should add LaTeX and Texinfo highlighting. I'd prefer to keep the colorize egg as unmodified as possible since this eases synchronization between our code and Lisppaste's code.
comment:30 Changed 14 years ago by
Thanks for porting colorize to Chicken. Indeed, there is no easy solution to syntax highlighting in TeXinfo?, and while there are several packages for colorized highlighting in LaTeX, they don't quite have the flexibility necessary for Scheme code. It would be nice to port the PLT Scheme syntax highlighting module (which is based on combinator parsing), but that is quite low on my list of priorities. I will mark wiki-parse as obsolete in the next few days.
Replying to sjamaan:
PS: You're absolutely right about the highlight and examples. I've moved them to qwiki. The highlighting now uses my new "colorize" egg for highlighting, but that only handles html output. For LaTeX and Texinfo it still simply puts the code in a preformatted block.
I'm not sure how and IF we should add LaTeX and Texinfo highlighting. I'd prefer to keep the colorize egg as unmodified as possible since this eases synchronization between our code and Lisppaste's code.
comment:31 Changed 14 years ago by
Do you have a link to that PLT code? I'd be interested to have a (quick) look.
Also, for LaTeX there's one pretty decent highlighter that I know of and that's Dorai Sitaram's slatex/tex2page highlighter. It produces pretty output (I think it's used in the html docs for r5rs and throughout PLT's online documentation) which works both in LaTeX and HTML. In fact, I started out by looking at that code, but decided against it because it's rather nasty, and Lisppaste supports many more languages which would be good for example code in C and the "chicken for programmers in other languages" pages.
But so we at least have *some* highlighting (even if it's Scheme-only) we could use the slatex code. I don't know if that also works for texinfo. I don't even know if texinfo has color support at all ;)
''foo''
should render <em>foo</em> rather than <i>foo</i>'''foo'''
should render <strong>foo</strong> rather than <b>foo</b>