source: project/wiki/eggref/5/html-parser @ 35716

Last change on this file since 35716 was 35716, checked in by felix winkelmann, 2 years ago

eggref/5/html-parser

File size: 3.8 KB
Line 
1[[tags: egg]]
2
3== html-parser
4
5[[toc:]]
6
7=== Description
8
9A permissive, scalable HTML parser.
10
11=== Author
12
13[[/users/alex-shinn|Alex Shinn]]
14
15=== Documentation
16
17{{html-parser}} is intended as a permissive HTML parser for people who
18prefer the scalable interface described in Oleg Kiselyov's SSAX
19parser, as well as providing simple convenience utilities.  It
20correctly handles all invalid HTML, inserting "virtual" starting and
21closing tags as needed to maintain the proper tree structure needed
22for the foldts down/up logic.  A major goal of this parser is
23bug-for-bug compatibility with the way common web browsers parse HTML.
24
25==== Main interface
26
27===== make-html-parser
28
29<procedure>(make-html-parser . keys)</procedure>
30
31Returns a procedure of two arguments, an initial seed and an
32optional input port, which parses the HTML document from the port
33with the callbacks specified in the plist {{KEYS}} (using normal,
34quoted symbols, for portability and to avoid making this a
35macro).  The following callbacks are recognized:
36
37  START: TAG ATTRS SEED VIRTUAL?
38      fdown in foldts, called when a start-tag is encountered.
39    TAG:         tag name
40    ATTRS:       tag attributes as a alist
41    SEED:        current seed value
42    VIRTUAL?:    #t iff this start tag was inserted to fix the HTML tree
43
44  END: TAG ATTRS PARENT-SEED SEED VIRTUAL?
45      fup in foldts, called when an end-tag is encountered.
46    TAG:         tag name
47    ATTRS:       tag attributes of the corresponding start tag
48    PARENT-SEED: parent seed value (i.e. seed passed to the start tag)
49    SEED:        current seed value
50    VIRTUAL?:    #t iff this end tag was inserted to fix the HTML tree
51
52  TEXT: TEXT SEED
53      fhere in foldts, called when any text is encountered.  May be
54      called multiple times between a start and end tag, so you need
55      to string-append yourself if desired.
56    TEXT:        entity-decoded text
57    SEED:        current seed value
58
59  COMMENT: TEXT SEED
60      fhere on comment data
61
62  DECL: NAME ATTRS SEED
63      fhere on declaration data
64     
65  PROCESS: LIST SEED
66      fhere on process-instruction data
67
68In addition, entity-mappings may be overriden with the {{ENTITIES:}}
69keyword.
70
71==== Convenience functions
72
73===== html->sxml
74
75<procedure>(html->sxml [port])</procedure>
76
77Returns the SXML representation of the document from {{PORT}}, using
78the default parsing options.
79
80===== html-strip
81
82<procedure>(html-strip [port])</procedure>
83
84Returns a string representation of the document from PORT with all
85tags removed.  No whitespace reduction or other rendering is done.
86
87=== Examples
88
89This is the definition of the {{html->sxml}} convenience function
90included in the egg:
91
92<enscript highlight=scheme>
93 (define html->sxml
94   (let ((parse
95          (make-html-parser
96           'start: (lambda (tag attrs seed virtual?) '())
97           'end:   (lambda (tag attrs parent-seed seed virtual?)
98                     `((,tag ,@(if (pair? attrs)
99                                   `((@ ,@attrs) ,@(reverse seed))
100                                   (reverse seed)))
101                       ,@parent-seed))
102           'decl:    (lambda (tag attrs seed) `((*DECL* ,tag ,@attrs) ,@seed))
103           'process: (lambda (attrs seed) `((*PI* ,@attrs) ,@seed))
104           'comment: (lambda (text seed) `((*COMMENT* ,text) ,@seed))
105           'text:    (lambda (text seed) (cons text seed))
106           )))
107     (lambda o
108       (reverse (apply parse '() o)))))
109</enscript>
110
111The parser for {{html-strip}} could be defined as:
112
113<enscript highlight=scheme>
114 (make-html-parser
115   'start: (lambda (tag attrs seed virtual?) seed)
116   'end:   (lambda (tag attrs parent-seed seed virtual?) seed)
117   'text:  (lambda (text seed) (display text)))
118</enscript>
119
120=== Changelog
121
122* 0.2 ported to CHICKEN 5 (by felix)
123* 0.1 Import upstream as of 2009-01-25
124
125=== License
126
127BSD-style license: [[http://synthcode.com/license.txt]].
128
Note: See TracBrowser for help on using the repository browser.