Opened 11 years ago

Closed 11 years ago

Last modified 10 years ago

#254 closed defect (fixed)

wiki-parse deficiencies

Reported by: Moritz Heidkamp Owned by: sjamaan
Priority: major Milestone:
Component: extensions Version: 4.5.0
Keywords: wiki-parse Cc: Ivan Raikov
Estimated difficulty:

Description

This ticket is supposed to act as a collection of parsing errors in wiki-parse.

Change History (32)

comment:1 Changed 11 years ago by Moritz Heidkamp

  • ''foo'' should render <em>foo</em> rather than <i>foo</i>
  • '''foo''' should render <strong>foo</strong> rather than <b>foo</b>

comment:2 Changed 11 years ago by Jim Ursetto

wiki-parse has quite a few other deficiencies, such as inability to handle nested lists. Assuming you want to keep svnwiki syntax exactly the same, an option is to just scrap it and use svnwiki-sxml from chicken-doc. You can also look separately at chicken-doc-html.scm in chickadee for ideas of how to render the SXML to HTML.

comment:3 Changed 11 years ago by Jim Ursetto

... and if you want to use svnwiki-sxml, I will pull it out into a separate egg.

comment:4 in reply to:  2 ; Changed 11 years ago by sjamaan

Replying to zbigniew:

wiki-parse has quite a few other deficiencies, such as inability to handle nested lists. Assuming you want to keep svnwiki syntax exactly the same, an option is to just scrap it and use svnwiki-sxml from chicken-doc. You can also look separately at chicken-doc-html.scm in chickadee for ideas of how to render the SXML to HTML.

I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee

If the svnwiki parser of chickadee is as complete as wiki-parse, an egg for it would be much appreciated. wiki-parse is buggy and kind of slow IMO.

comment:5 in reply to:  description ; Changed 11 years ago by sjamaan

Replying to syn:

This ticket is supposed to act as a collection of parsing errors in wiki-parse.

Agreed

  • handle nowiki in some reasonable way

What's unreasonable about the way it's handled now? The chicken logo and ohloh factoids sheet on the index page are both nowiki and look fine to me.

comment:6 in reply to:  5 Changed 11 years ago by Moritz Heidkamp

Replying to sjamaan:

  • handle nowiki in some reasonable way

What's unreasonable about the way it's handled now? The chicken logo and ohloh factoids sheet on the index page are both nowiki and look fine to me.

I was just being dumb, of course. I was referring to tables like for example on http://209.172.49.65/portability and assumed that it would involve nowiki tags. Apparently it doesn't for what ever reason :) So let me rephrase: handle tables in some reasonable way.

comment:7 in reply to:  4 Changed 11 years ago by Mario Domenech Goulart

Replying to sjamaan:

I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee

Try git clone http://3e8.org/chickadee.git

comment:8 in reply to:  description Changed 11 years ago by Ivan Raikov

Definition lists: the problem you are describing is caused by the fact that there is a newline between every element of the list. The current wiki-parse is line- and regexp- based, and that means this issue can only be fixed by some heuristic matching, which would complicate the code. I should also note here that MediaWiki? does the same thing if the elements of a definition list are separated by newlines.

Enscript tags: enscript is actually for syntax highlighting; preformatted code can be represented by indented text (same as svnwiki and MediaWiki?). I suggest adopting a separate tag for syntax highlighting that can be handled by corresponding extensions in the rendering code, and maybe avoiding enscript altogether.

As for the other issues, I would be happy to switch away from the regexp-oriented parsing, which has led to difficulties in code maintenance, but I think we should also adopt more reasonable and well-defined syntax rules, such as RST or MediaWiki?. svnwiki tables, in particular, are a mess, which is why I have chosen to use MediaWiki? table syntax in wiki-parse.

comment:9 in reply to:  4 ; Changed 11 years ago by Ivan Raikov

Out of curiosity, have you done any concrete benchmarking of wiki-parse? My only test case besides the Chicken manual is quite small: about 2.3MB of wiki content, divided in about 160 files, which seem to be parsed in 2-3 seconds on my system. The Chicken manual is about 1.8MB currently.

Replying to sjamaan:

Replying to zbigniew:

wiki-parse has quite a few other deficiencies, such as inability to handle nested lists. Assuming you want to keep svnwiki syntax exactly the same, an option is to just scrap it and use svnwiki-sxml from chicken-doc. You can also look separately at chicken-doc-html.scm in chickadee for ideas of how to render the SXML to HTML.

I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee

If the svnwiki parser of chickadee is as complete as wiki-parse, an egg for it would be much appreciated. wiki-parse is buggy and kind of slow IMO.

comment:10 in reply to:  4 Changed 11 years ago by Jim Ursetto

Replying to sjamaan:

I can't find the code for chickadee. It's not under eggs, and I don't see a repo link on http://3e8.org/zb/ or http://3e8.org/chickadee

If the svnwiki parser of chickadee is as complete as wiki-parse, an egg for it would be much appreciated. wiki-parse is buggy and kind of slow IMO.

The svnwiki->sxml parser is part of chicken-doc-admin. (Node data is stored as sxml.) svnwiki-sxml->text is included in chicken-doc. svnwiki-sxml->html is part of chickadee. You want the first one, which I will break out into an egg. But you can look at the others if you want an example for html output.

You can see the advantages for yourself in the chickadee output. Including the ability to parse lists that have embedded newlines. It's not the best parser, but it seems to do the job.

It's fine to improve the svnwiki syntax, but I would aim for backward compatibility, as Felix has recommended. If you do this, the changes can be integrated into svnwiki-sxml.

comment:11 Changed 11 years ago by Jim Ursetto

I can't guarantee it's fast, though. It's probably not any faster than wiki-parse.

comment:12 Changed 11 years ago by Jim Ursetto

Sorry for the reply spam, but I forgot to mention that, if you install chicken-doc-admin, the svnwiki-sxml egg will be installed as well. It's fully eggified already, just "private" to chicken-doc-admin. So you can play around with it without any action on my part. In fact you can simply "chicken-install svnwiki-sxml" from within the chicken-doc-admin directory, if you don't want to install the latter.

comment:13 in reply to:  9 Changed 11 years ago by sjamaan

Replying to iraikov:

Out of curiosity, have you done any concrete benchmarking of wiki-parse? My only test case besides the Chicken manual is quite small: about 2.3MB of wiki content, divided in about 160 files, which seem to be parsed in 2-3 seconds on my system. The Chicken manual is about 1.8MB currently.

I've taken a random sample from the wiki contents to test qwiki with, and I noticed that especially the page for 9p (eggref/4/9p) is extremely slow. I just tested this and it turns out that it takes about 13 seconds to generate the full html page:

#;7> (time (qwiki-update-file! '("9p")))
  14.809 seconds elapsed
   5.268 seconds in (major) GC
 2603950 mutations
     212 minor GCs
     194 major GCs

I was wrong thinking it was all caused by wiki-parse, though!

#;2> (time (call-with-input-file "/tmp/qwiki/9p" wiki-parse))
   0.969 seconds elapsed
   0.025 seconds in (major) GC
   16028 mutations
    1522 minor GCs
       1 major GCs

I think one second for parsing is a bit on the slow side, but acceptable.

zbigniew: it turns out svnwiki->sxml is faster, not slower (on this particular file) than wiki-parse :)

#;4> (time (call-with-input-file "/tmp/qwiki/9p" svnwiki->sxml))
   0.24 seconds elapsed
   0.016 seconds in (major) GC
    7161 mutations
    1085 minor GCs
       1 major GCs

comment:14 Changed 11 years ago by sjamaan

Crap, it's the search index updater that takes so much time:

(time (qwiki-update-file! '("9p")))
   2.444 seconds elapsed
    0.87 seconds in (major) GC
  176608 mutations
     163 minor GCs
      28 major GCs
#;16> (search-install!)
#;17> (time (qwiki-update-file! '("9p")))
  17.597 seconds elapsed
   8.018 seconds in (major) GC
 2600848 mutations
     262 minor GCs
     271 major GCs

Next suspects: http-client, estraier-client or the estraier master.

comment:15 Changed 11 years ago by sjamaan

After disabling the actual request in estraier-client:

(time (qwiki-update-file! '("9p")))
  13.429 seconds elapsed
   3.484 seconds in (major) GC
 2732397 mutations
     155 minor GCs
     108 major GCs

After disabling the call to put-document in qwiki-search:

(time (qwiki-update-file! '("9p")))
   4.555 seconds elapsed
   2.262 seconds in (major) GC
  342063 mutations
     238 minor GCs
      69 major GCs

I guess that rules out http-client and qwiki-search and leaves estraier-client.

comment:16 Changed 11 years ago by sjamaan

Much better now! :)

Turns out that string-substitute is many, many times slower than irregex-replace/all.

(time (qwiki-update-file! '("9p")))
   2.269 seconds elapsed
   0.196 seconds in (major) GC
  327123 mutations
     184 minor GCs
       6 major GCs
#;24> (search-install!)
#;25> (time (qwiki-update-file! '("9p")))
     4.1 seconds elapsed
   0.409 seconds in (major) GC
  752288 mutations
     474 minor GCs
      13 major GCs

mario should just update estraier-client to version 0.2 and it'll be acceptably fast.

comment:17 in reply to:  description Changed 11 years ago by Ivan Raikov

This link might be of interest:
http://www.wikicreole.org/

WikiCreole? as an attempt to standardize wiki syntax. There is a complete BNF specification for WikiCreole?, and it is not very far from that of svnwiki. I think a real grammar is a much better than the error-prone heuristic parsing with regexes that is in use with most wiki engines.

Replying to syn:

This ticket is supposed to act as a collection of parsing errors in wiki-parse.

comment:18 Changed 11 years ago by sjamaan

There is no separate EBNF for the Creole parser. There's only the ANTLR code which implements that, published at http://wikicreole.sf.net

Unfortunately this code that now falls under the AGPL which I wouldn't want to touch with a ten-foot pole.

comment:19 Changed 11 years ago by sjamaan

Since this wikicreole thing seems to be going nowhere I decided to hack on multidoc & qwiki to replace wiki-parse with svnwiki-sxml. So far it seems to work better and more reliably. I've also changed the TOC to output a nested list structure instead of a flat list so this works more like svnwiki too.

You can find this code in qwiki/branches/svnwiki-sxml and multidoc/branches/svnwiki-sxml.

Ivan: I'd appreciate it if you could have a look at the LaTeX and texinfo code. It seems to be broken (even for the wiki-parse version). I've hacked it a little to get it to at least output some stuff again with svnwiki-sxml, but it's not working properly (for example the document headers are not included)

mario: Could you update qwiki and multidoc on call-cc.org? You will have to remove all references to qwiki-nowiki, since svnwiki-sxml takes care of the nowiki parsing all by itself (@ivan & zbigniew: this may cause some trouble with the LaTeX and texinfo output)

comment:20 in reply to:  19 ; Changed 11 years ago by Ivan Raikov

Ok, I will look at that. I use the LaTeX backend with wiki-parse generated output on an almost daily basis, so it should be working, but maybe I am using some additional templates that you don't have. I will check and let you know.

Replying to sjamaan:

Ivan: I'd appreciate it if you could have a look at the LaTeX and texinfo code. It seems to be broken (even for the wiki-parse version). I've hacked it a little to get it to at least output some stuff again with svnwiki-sxml, but it's not working properly (for example the document headers are not included)

comment:21 Changed 11 years ago by sjamaan

Just for clarification: I think it's the qwiki-specific LaTeX and texinfo generation that's broken. The multidoc part is okay.

comment:22 in reply to:  20 ; Changed 11 years ago by Ivan Raikov

The LaTeX backend should be working now, I have committed a fix in the qwiki trunk. Let me know if you want me to merge this in your branch also.

Replying to iraikov:

Ok, I will look at that. I use the LaTeX backend with wiki-parse generated output on an almost daily basis, so it should be working, but maybe I am using some additional templates that you don't have. I will check and let you know.

Replying to sjamaan:

Ivan: I'd appreciate it if you could have a look at the LaTeX and texinfo code. It seems to be broken (even for the wiki-parse version). I've hacked it a little to get it to at least output some stuff again with svnwiki-sxml, but it's not working properly (for example the document headers are not included)

comment:23 in reply to:  22 Changed 11 years ago by sjamaan

Replying to iraikov:

The LaTeX backend should be working now, I have committed a fix in the qwiki trunk. Let me know if you want me to merge this in your branch also.

I'd appreciate that. I intend to merge the qwiki and multidoc "svnwiki-sxml" branches into their respective trunks sometime soon, unless there are any objections. svnwiki-sxml is a much better parser and I've already made some other important changes in that branch.

I think we should mark the wiki-parse egg as deprecated or even remove it from the egg repo altogether.

comment:24 Changed 11 years ago by sjamaan

I saw you've pulled the changes you made into the branch.
I'd like to reintegrate the multidoc & qwiki branches into their trunks again. Is that alright with you, Ivan?

comment:25 in reply to:  24 ; Changed 11 years ago by Ivan Raikov

Yes, I looked at the code and it seems ok. One minor thing is that the `section' rule starts with a lowercase letter in the HTML backend and with an uppercase letter in the LaTeX backend. We should probably choose a consistent capitalization for all rules.

Replying to sjamaan:

I saw you've pulled the changes you made into the branch.
I'd like to reintegrate the multidoc & qwiki branches into their trunks again. Is that alright with you, Ivan?

comment:26 in reply to:  25 ; Changed 11 years ago by sjamaan

Replying to iraikov:

Yes, I looked at the code and it seems ok. One minor thing is that the `section' rule starts with a lowercase letter in the HTML backend and with an uppercase letter in the LaTeX backend. We should probably choose a consistent capitalization for all rules.

That's odd, I don't see that. Are you sure you're looking at the branch code? svnwiki-sxml generates all-lowercase section elements, so if there's a capital somewhere it's probably broken.

comment:27 in reply to:  26 Changed 11 years ago by Ivan Raikov

Ok, I see now that I was looking at the trunk files instead of the svnwiki-sxml branch. I think that the examples and highlight rule definitely do not belong in the core multidoc, but should be instead part of svnwiki and its extensions. Particularly highlight, something like this would almost certainly be handled by external parsers.

Replying to sjamaan:

Replying to iraikov:

Yes, I looked at the code and it seems ok. One minor thing is that the `section' rule starts with a lowercase letter in the HTML backend and with an uppercase letter in the LaTeX backend. We should probably choose a consistent capitalization for all rules.

That's odd, I don't see that. Are you sure you're looking at the branch code? svnwiki-sxml generates all-lowercase section elements, so if there's a capital somewhere it's probably broken.

comment:28 Changed 11 years ago by sjamaan

Resolution: fixed
Status: newclosed

I've copied over the files from the svnwiki-sxml branches of multidoc and qwiki to their respective trunks and closed the branches.

We could remove or hide the wiki-parse egg. I'll leave that up to you, Ivan.

comment:29 Changed 11 years ago by sjamaan

PS: You're absolutely right about the highlight and examples. I've moved them to qwiki. The highlighting now uses my new "colorize" egg for highlighting, but that only handles html output. For LaTeX and Texinfo it still simply puts the code in a preformatted block.

I'm not sure how and IF we should add LaTeX and Texinfo highlighting. I'd prefer to keep the colorize egg as unmodified as possible since this eases synchronization between our code and Lisppaste's code.

comment:30 in reply to:  29 Changed 11 years ago by Ivan Raikov

Thanks for porting colorize to Chicken. Indeed, there is no easy solution to syntax highlighting in TeXinfo?, and while there are several packages for colorized highlighting in LaTeX, they don't quite have the flexibility necessary for Scheme code. It would be nice to port the PLT Scheme syntax highlighting module (which is based on combinator parsing), but that is quite low on my list of priorities. I will mark wiki-parse as obsolete in the next few days.

Replying to sjamaan:

PS: You're absolutely right about the highlight and examples. I've moved them to qwiki. The highlighting now uses my new "colorize" egg for highlighting, but that only handles html output. For LaTeX and Texinfo it still simply puts the code in a preformatted block.

I'm not sure how and IF we should add LaTeX and Texinfo highlighting. I'd prefer to keep the colorize egg as unmodified as possible since this eases synchronization between our code and Lisppaste's code.

comment:31 Changed 11 years ago by sjamaan

Do you have a link to that PLT code? I'd be interested to have a (quick) look.

Also, for LaTeX there's one pretty decent highlighter that I know of and that's Dorai Sitaram's slatex/tex2page highlighter. It produces pretty output (I think it's used in the html docs for r5rs and throughout PLT's online documentation) which works both in LaTeX and HTML. In fact, I started out by looking at that code, but decided against it because it's rather nasty, and Lisppaste supports many more languages which would be good for example code in C and the "chicken for programmers in other languages" pages.

But so we at least have *some* highlighting (even if it's Scheme-only) we could use the slatex code. I don't know if that also works for texinfo. I don't even know if texinfo has color support at all ;)

comment:32 Changed 10 years ago by felix winkelmann

Milestone: 4.6.0

Milestone 4.6.0 deleted

Note: See TracTickets for help on using tickets.