source: project/release/4/ssax/SSAX.scm @ 15175

Last change on this file since 15175 was 15175, checked in by Ivan Raikov, 12 years ago

removed definitions of fold from ssax (uses srfi-1)

File size: 121.7 KB
Line 
1;       Functional XML parsing framework: SAX/DOM and SXML parsers
2;             with support for XML Namespaces and validation
3;
4; This is a package of low-to-high level lexing and parsing procedures
5; that can be combined to yield a SAX, a DOM, a validating parsers, or
6; a parser intended for a particular document type. The procedures in
7; the package can be used separately to tokenize or parse various
8; pieces of XML documents. The package supports XML Namespaces,
9; internal and external parsed entities, user-controlled handling of
10; whitespace, and validation. This module therefore is intended to be
11; a framework, a set of "Lego blocks" you can use to build a parser
12; following any discipline and performing validation to any degree. As
13; an example of the parser construction, this file includes a
14; semi-validating SXML parser.
15
16; The present XML framework has a "sequential" feel of SAX yet a
17; "functional style" of DOM. Like a SAX parser, the framework scans
18; the document only once and permits incremental processing. An
19; application that handles document elements in order can run as
20; efficiently as possible. _Unlike_ a SAX parser, the framework does
21; not require an application register stateful callbacks and surrender
22; control to the parser. Rather, it is the application that can drive
23; the framework -- calling its functions to get the current lexical or
24; syntax element. These functions do not maintain or mutate any state
25; save the input port. Therefore, the framework permits parsing of XML
26; in a pure functional style, with the input port being a monad (or a
27; linear, read-once parameter).
28
29; Besides the PORT, there is another monad -- SEED. Most of the
30; middle- and high-level parsers are single-threaded through the
31; seed. The functions of this framework do not process or affect the
32; SEED in any way: they simply pass it around as an instance of an
33; opaque datatype.  User functions, on the other hand, can use the
34; seed to maintain user's state, to accumulate parsing results, etc. A
35; user can freely mix his own functions with those of the
36; framework. On the other hand, the user may wish to instantiate a
37; high-level parser: ssax:make-elem-parser or ssax:make-parser.  In
38; the latter case, the user must provide functions of specific
39; signatures, which are called at predictable moments during the
40; parsing: to handle character data, element data, or processing
41; instructions (PI). The functions are always given the SEED, among
42; other parameters, and must return the new SEED.
43
44; From a functional point of view, XML parsing is a combined
45; pre-post-order traversal of a "tree" that is the XML document
46; itself. This down-and-up traversal tells the user about an element
47; when its start tag is encountered. The user is notified about the
48; element once more, after all element's children have been
49; handled. The process of XML parsing therefore is a fold over the
50; raw XML document. Unlike a fold over trees defined in [1], the
51; parser is necessarily single-threaded -- obviously as elements
52; in a text XML document are laid down sequentially. The parser
53; therefore is a tree fold that has been transformed to accept an
54; accumulating parameter [1,2].
55
56; Formally, the denotational semantics of the parser can be expressed
57; as
58; parser:: (Start-tag -> Seed -> Seed) ->
59;          (Start-tag -> Seed -> Seed -> Seed) ->
60;          (Char-Data -> Seed -> Seed) ->
61;          XML-text-fragment -> Seed -> Seed
62; parser fdown fup fchar "<elem attrs> content </elem>" seed
63;  = fup "<elem attrs>" seed
64;       (parser fdown fup fchar "content" (fdown "<elem attrs>" seed))
65;
66; parser fdown fup fchar "char-data content" seed
67;  = parser fdown fup fchar "content" (fchar "char-data" seed)
68;
69; parser fdown fup fchar "elem-content content" seed
70;  = parser fdown fup fchar "content" (
71;       parser fdown fup fchar "elem-content" seed)
72
73; Compare the last two equations with the left fold
74; fold-left kons elem:list seed = fold-left kons list (kons elem seed)
75
76; The real parser created my ssax:make-parser is slightly more complicated,
77; to account for processing instructions, entity references, namespaces,
78; processing of document type declaration, etc.
79
80
81; The XML standard document referred to in this module is
82;       http://www.w3.org/TR/1998/REC-xml-19980210.html
83;
84; The present file also defines a procedure that parses the text of an
85; XML document or of a separate element into SXML, an
86; S-expression-based model of an XML Information Set. SXML is also an
87; Abstract Syntax Tree of an XML document. SXML is similar
88; but not identical to DOM; SXML is particularly suitable for
89; Scheme-based XML/HTML authoring, SXPath queries, and tree
90; transformations. See SXML.html for more details.
91; SXML is a term implementation of evaluation of the XML document [3].
92; The other implementation is context-passing.
93
94; The present frameworks fully supports the XML Namespaces Recommendation:
95;       http://www.w3.org/TR/REC-xml-names/
96; Other links:
97; [1] Jeremy Gibbons, Geraint Jones, "The Under-appreciated Unfold,"
98; Proc. ICFP'98, 1998, pp. 273-279.
99; [2] Richard S. Bird, The promotion and accumulation strategies in
100; transformational programming, ACM Trans. Progr. Lang. Systems,
101; 6(4):487-504, October 1984.
102; [3] Ralf Hinze, "Deriving Backtracking Monad Transformers,"
103; Functional Pearl. Proc ICFP'00, pp. 186-197.
104
105; IMPORT
106; parser-error ssax:warn, see Handling of errors, below
107; functions declared in files util.scm, input-parse.scm and look-for-str.scm
108; char-encoding.scm for various platform-specific character-encoding functions.
109; From SRFI-13: string-concatenate/shared and string-concatenate-reverse/shared
110; If a particular implementation lacks SRFI-13 support, please
111; include the file srfi-13-local.scm
112
113; Handling of errors
114; This package relies on a function parser-error, which must be defined
115; by a user of the package. The function has the following signature:
116;       parser-error PORT MESSAGE SPECIALISING-MSG*
117; Many procedures of this package call 'parser-error' whenever a
118; parsing, well-formedness or validation error is encountered. The
119; first argument is a port, which typically points to the offending
120; character or its neighborhood. Most of the Scheme systems let the
121; user query a PORT for the current position. The MESSAGE argument
122; indicates a failed XML production or a failed XML constraint. The
123; latter is referred to by its anchor name in the XML Recommendation
124; or XML Namespaces Recommendation. The parsing library (e.g.,
125; next-token, assert-curr-char) invoke 'parser-error' as well, in
126; exactly the same way.  See input-parse.scm for more details.
127; See
128;       http://pair.com/lisovsky/download/parse-error.scm
129; for an excellent example of such a redefined parser-error function.
130;
131; In addition, the present code invokes a function ssax:warn
132;   ssax:warn PORT MESSAGE SPECIALISING-MSG*
133; to notify the user about warnings that are NOT errors but still
134; may alert the user.
135;
136; Again, parser-error and ssax:warn are supposed to be defined by the
137; user. However, if a run-test macro below is set to include
138; self-tests, this present code does provide the definitions for these
139; functions to allow tests to run.
140
141; Misc notes
142; It seems it is highly desirable to separate tests out in a dedicated
143; file.
144;
145; Jim Bender wrote on Mon, 9 Sep 2002 20:03:42 EDT on the SSAX-SXML
146; mailing list (message A fine-grained "lego")
147; The task was to record precise source location information, as PLT
148; does with its current XML parser. That parser records the start and
149; end location (filepos, line#, column#) for pi, elements, attributes,
150; chuncks of "pcdata".
151; As suggested above, though, in some cases I needed to be able force
152; open an interface that did not yet exist. For instance, I added an
153; "end-char-data-hook", which would be called at the end of char-data
154; fragment. This returns a function of type (seed -> seed) which is
155; invoked on the current seed only if read-char-data has indeed reached
156; the end of a block of char data (after reading a new token.
157; But the deepest interface that I needed to expose was that of reading
158; attributes. In the official distribution, this is not even a separate
159; function. Instead, it is embedded within SSAX:read-attributes.  This
160; required some small re-structuring as well.
161; This definitely will not be to everyone's taste (nor needed by most).
162; Certainly, the existing make-parser interface addresses most custom
163; needs. And likely 80-90 lines of a "link specification" to create a
164; parser from many tiny little lego blocks may please only a few, while
165; appalling others.
166; The code is available at http://celtic.benderweb.net/ssax-lego.plt or
167; http://celtic.benderweb.net/ssax-lego.tar.gz
168; In the examples directory, I provide:
169; - a unit version of the make-parser interface,
170; - a simple SXML parser using that interface,
171; - an SXML parser which directly uses the "new lego",
172; - a pseudo-SXML parser, which records source location information
173; - and lastly a parser which returns the structures used in PLT's xml
174; collection, with source location information
175
176; $Id: SSAX.scm,v 5.4 2004/11/09 20:22:26 sperber Exp $
177;^^^^^^^^^
178
179
180        ; See the Makefile in the ../tests directory
181        ; (in particular, the rule vSSAX) for an example of how
182        ; to run this code on various Scheme systems.
183        ; See SSAX examples for many samples of using this code,
184        ; again, on a variety of Scheme systems.
185        ; See http://ssax.sf.net/
186
187
188; The following macro runs built-in test cases -- or does not run,
189; depending on which of the two cases below you commented out
190; Case 1: no tests:
191;(define-macro run-test (lambda body '(begin #f)))
192
193(define-syntax run-test (syntax-rules () ((run-test . args) (begin #f))))
194
195; Case 2: with tests.
196; The following macro could've been defined just as
197; (define-macro run-test (lambda body `(begin (display "\n-->Test\n") ,@body)))
198;
199; Instead, it's more involved, to make up for case-insensitivity of
200; symbols on some Scheme systems. In Gambit, symbols are case
201; sensitive: (eq? 'A 'a) is #f and (eq? 'Aa (string->symbol "Aa")) is
202; #t.  On some systems, symbols are case-insensitive and just the
203; opposite is true.  Therefore, we introduce a notation '"ASymbol" (a
204; quoted string) that stands for a case-_sensitive_ ASymbol -- on any
205; R5RS Scheme system. This notation is valid only within the body of
206; run-test.
207; The notation is implemented by scanning the run-test's
208; body and replacing every occurrence of (quote "str") with the result
209; of (string->symbol "str"). We can do such a replacement at macro-expand
210; time (rather than at run time).
211
212; Here's the previous version of run-test, implemented as a low-level
213; macro.
214; (define-macro run-test
215;   (lambda body
216;     (define (re-write body)
217;       (cond
218;        ((vector? body)
219;       (list->vector (re-write (vector->list body))))
220;        ((not (pair? body)) body)
221;        ((and (eq? 'quote (car body)) (pair? (cdr body))
222;            (string? (cadr body)))
223;       (string->symbol (cadr body)))
224;        (else (cons (re-write (car body)) (re-write (cdr body))))))
225;     (cons 'begin (re-write body))))
226;
227; For portability, it is re-written as syntax-rules. The syntax-rules
228; version is less powerful: for example, it can't handle
229; (case x (('"Foo") (do-on-Foo))) whereas the low-level macro
230; could correctly place a case-sensitive symbol at the right place.
231; We also do not scan vectors (because we don't use them here).
232; Twice-deep quasiquotes aren't handled either.
233; Still, the syntax-rules version satisfies our immediate needs.
234; Incidentally, I originally didn't believe that the macro below
235; was at all possible.
236;
237; The macro is written in a continuation-passing style. A continuation
238; typically has the following structure: (k-head ! . args)
239; When the continuation is invoked, we expand into
240; (k-head <computed-result> . arg). That is, the dedicated symbol !
241; is the placeholder for the result.
242;
243; It seems that the most modular way to write the run-test macro would
244; be the following
245;
246; (define-syntax run-test
247;  (syntax-rules ()
248;   ((run-test . ?body)
249;     (letrec-syntax
250;       ((scan-exp                      ; (scan-exp body k)
251;        (syntax-rules (quote quasiquote !)
252;          ((scan-exp (quote (hd . tl)) k)
253;            (scan-lit-lst (hd . tl) (do-wrap ! quasiquote k)))
254;          ((scan-exp (quote x) (k-head ! . args))
255;            (k-head
256;              (if (string? (quote x)) (string->symbol (quote x)) (quote x))
257;              . args))
258;          ((scan-exp (hd . tl) k)
259;            (scan-exp hd (do-tl ! scan-exp tl k)))
260;          ((scan-exp x (k-head ! . args))
261;            (k-head x . args))))
262;       (do-tl
263;         (syntax-rules (!)
264;           ((do-tl processed-hd fn () (k-head ! . args))
265;             (k-head (processed-hd) . args))
266;           ((do-tl processed-hd fn old-tl k)
267;             (fn old-tl (do-cons ! processed-hd k)))))
268;       ...
269;       (do-finish
270;         (syntax-rules ()
271;           ((do-finish (new-body)) new-body)
272;           ((do-finish new-body) (begin . new-body))))
273;       ...
274;       (scan-exp ?body (do-finish !))
275; ))))
276;
277; Alas, that doesn't work on all systems. We hit yet another dark
278; corner of the R5RS macros. The reason is that run-test is used in
279; the code below to introduce definitions. For example:
280; (run-test
281;  (define (ssax:warn port msg . other-msg)
282;    (apply cerr (cons* nl "Warning: " msg other-msg)))
283; )
284; This code expands to
285; (begin
286;    (define (ssax:warn port msg . other-msg) ...))
287; so the definition gets spliced in into the top level. Right?
288; Well, On Petite Chez Scheme it is so. However, many other systems
289; don't like this approach. The reason is that the invocation of
290; (run-test (define (ssax:warn port msg . other-msg) ...))
291; first expands into
292; (letrec-syntax (...)
293;   (scan-exp ((define (ssax:warn port msg . other-msg) ...)) ...))
294; because of the presence of (letrec-syntax ...), the begin form that
295; is generated eventually is no longer at the top level! The begin
296; form in Scheme is an overloading of two distinct forms: top-level
297; begin and the other begin. The forms have different rules: for example,
298; (begin (define x 1)) is OK for a top-level begin but not OK for
299; the other begin. Some Scheme systems see the that the macro
300; (run-test ...) expands into (letrec-syntax ...) and decide right there
301; that any further (begin ...) forms are NOT top-level begin forms.
302; The only way out is to make sure all our macros are top-level.
303; The best approach <sigh> seems to be to make run-test one huge
304; top-level macro.
305
306
307;; (define-syntax run-test
308;;  (syntax-rules (define)
309;;    ((run-test "scan-exp" (define vars body))
310;;     (define vars (run-test "scan-exp" body)))
311;;    ((run-test "scan-exp" ?body)
312;;     (letrec-syntax
313;;       ((scan-exp                     ; (scan-exp body k)
314;;       (syntax-rules (quote quasiquote !)
315;;         ((scan-exp '() (k-head ! . args))
316;;           (k-head '() . args))
317;;         ((scan-exp (quote (hd . tl)) k)
318;;           (scan-lit-lst (hd . tl) (do-wrap ! quasiquote k)))
319;;         ((scan-exp (quasiquote (hd . tl)) k)
320;;           (scan-lit-lst (hd . tl) (do-wrap ! quasiquote k)))
321;;         ((scan-exp (quote x) (k-head ! . args))
322;;           (k-head
323;;             (if (string? (quote x)) (string->symbol (quote x)) (quote x))
324;;             . args))
325;;         ((scan-exp (hd . tl) k)
326;;           (scan-exp hd (do-tl ! scan-exp tl k)))
327;;         ((scan-exp x (k-head ! . args))
328;;           (k-head x . args))))
329;;      (do-tl
330;;        (syntax-rules (!)
331;;          ((do-tl processed-hd fn () (k-head ! . args))
332;;            (k-head (processed-hd) . args))
333;;          ((do-tl processed-hd fn old-tl k)
334;;            (fn old-tl (do-cons ! processed-hd k)))))
335;;      (do-cons
336;;        (syntax-rules (!)
337;;          ((do-cons processed-tl processed-hd (k-head ! . args))
338;;            (k-head (processed-hd . processed-tl) . args))))
339;;      (do-wrap
340;;        (syntax-rules (!)
341;;          ((do-wrap val fn (k-head ! . args))
342;;            (k-head (fn val) . args))))
343;;      (do-finish
344;;        (syntax-rules ()
345;;          ((do-finish new-body) new-body)))
346
347;;      (scan-lit-lst                   ; scan literal list
348;;        (syntax-rules (quote unquote unquote-splicing !)
349;;         ((scan-lit-lst '() (k-head ! . args))
350;;           (k-head '() . args))
351;;         ((scan-lit-lst (quote (hd . tl)) k)
352;;           (do-tl quote scan-lit-lst ((hd . tl)) k))
353;;         ((scan-lit-lst (unquote x) k)
354;;           (scan-exp x (do-wrap ! unquote k)))
355;;         ((scan-lit-lst (unquote-splicing x) k)
356;;           (scan-exp x (do-wrap ! unquote-splicing k)))
357;;         ((scan-lit-lst (quote x) (k-head ! . args))
358;;           (k-head
359;;             ,(if (string? (quote x)) (string->symbol (quote x)) (quote x))
360;;             . args))
361;;          ((scan-lit-lst (hd . tl) k)
362;;            (scan-lit-lst hd (do-tl ! scan-lit-lst tl k)))
363;;          ((scan-lit-lst x (k-head ! . args))
364;;            (k-head x . args))))
365;;      )
366;;       (scan-exp ?body (do-finish !))))
367;;   ((run-test body ...)
368;;    (begin
369;;      (run-test "scan-exp" body) ...))
370;; ))
371
372;========================================================================
373;                               Data Types
374
375; TAG-KIND
376;       a symbol 'START, 'END, 'PI, 'DECL, 'COMMENT, 'CDSECT
377;               or 'ENTITY-REF that identifies a markup token
378
379; UNRES-NAME
380;       a name (called GI in the XML Recommendation) as given in an xml
381;       document for a markup token: start-tag, PI target, attribute name.
382;       If a GI is an NCName, UNRES-NAME is this NCName converted into
383;       a Scheme symbol. If a GI is a QName, UNRES-NAME is a pair of
384;       symbols: (PREFIX . LOCALPART)
385
386; RES-NAME
387;       An expanded name, a resolved version of an UNRES-NAME.
388;       For an element or an attribute name with a non-empty namespace URI,
389;       RES-NAME is a pair of symbols, (URI-SYMB . LOCALPART).
390;       Otherwise, it's a single symbol.
391
392; ELEM-CONTENT-MODEL
393;       A symbol:
394;       ANY       - anything goes, expect an END tag.
395;       EMPTY-TAG - no content, and no END-tag is coming
396;       EMPTY     - no content, expect the END-tag as the next token
397;       PCDATA    - expect character data only, and no children elements
398;       MIXED
399;       ELEM-CONTENT
400
401; URI-SYMB
402;       A symbol representing a namespace URI -- or other symbol chosen
403;       by the user to represent URI. In the former case,
404;       URI-SYMB is created by %-quoting of bad URI characters and
405;       converting the resulting string into a symbol.
406
407; NAMESPACES
408;       A list representing namespaces in effect. An element of the list
409;       has one of the following forms:
410;       (PREFIX URI-SYMB . URI-SYMB) or
411;       (PREFIX USER-PREFIX . URI-SYMB)
412;               USER-PREFIX is a symbol chosen by the user
413;               to represent the URI.
414;       (#f USER-PREFIX . URI-SYMB)
415;               Specification of the user-chosen prefix and a URI-SYMBOL.
416;       (*DEFAULT* USER-PREFIX . URI-SYMB)
417;               Declaration of the default namespace
418;       (*DEFAULT* #f . #f)
419;               Un-declaration of the default namespace. This notation
420;               represents overriding of the previous declaration
421;       A NAMESPACES list may contain several elements for the same PREFIX.
422;       The one closest to the beginning of the list takes effect.
423
424; ATTLIST
425;       An ordered collection of (NAME . VALUE) pairs, where NAME is
426;       a RES-NAME or an UNRES-NAME. The collection is an ADT
427
428; STR-HANDLER
429;       A procedure of three arguments: STRING1 STRING2 SEED
430;       returning a new SEED
431;       The procedure is supposed to handle a chunk of character data
432;       STRING1 followed by a chunk of character data STRING2.
433;       STRING2 is a short string, often "\n" and even ""
434
435; ENTITIES
436;       An assoc list of pairs:
437;          (named-entity-name . named-entity-body)
438;       where named-entity-name is a symbol under which the entity was
439;       declared, named-entity-body is either a string, or
440;       (for an external entity) a thunk that will return an
441;       input port (from which the entity can be read).
442;       named-entity-body may also be #f. This is an indication that a
443;       named-entity-name is currently being expanded. A reference to
444;       this named-entity-name will be an error: violation of the
445;       WFC nonrecursion.
446
447; XML-TOKEN -- a record
448
449; In Gambit, you can use the following declaration:
450; (define-structure xml-token kind head)
451; The following declaration is "standard" as it follows SRFI-9:
452; (define-record-type  xml-token  (make-xml-token kind head)  xml-token?
453;   (kind  xml-token-kind)
454;   (head  xml-token-head) )
455; No field mutators are declared as SSAX is a pure functional parser
456;
457; But to make the code more portable, we define xml-token simply as
458; a pair. It suffices for us. Furthermore, xml-token-kind and xml-token-head
459; can be defined as simple procedures. However, they are declared as
460; macros below for efficiency.
461
462; procedure: make-xml-token KIND HEAD
463; This creates an XML token.
464(define (make-xml-token kind head) (cons kind head))
465; procedure: xml-token? THING
466(define xml-token? pair?)
467; syntax: xml-token-kind XML-TOKEN
468(define-syntax xml-token-kind 
469  (syntax-rules () ((xml-token-kind token) (car token))))
470; syntax: xml-token-head XML-TOKEN
471(define-syntax xml-token-head 
472  (syntax-rules () ((xml-token-head token) (cdr token))))
473
474; (define-macro xml-token-kind (lambda (token) `(car ,token)))
475; (define-macro xml-token-head (lambda (token) `(cdr ,token)))
476
477; This record represents a markup, which is, according to the XML
478; Recommendation, "takes the form of start-tags, end-tags, empty-element tags,
479; entity references, character references, comments, CDATA section delimiters,
480; document type declarations, and processing instructions."
481;
482;       kind -- a TAG-KIND
483;       head -- an UNRES-NAME. For xml-tokens of kinds 'COMMENT and
484;               'CDSECT, the head is #f
485;
486; For example,
487;       <P>  => kind='START, head='P
488;       </P> => kind='END, head='P
489;       <BR/> => kind='EMPTY-EL, head='BR
490;       <!DOCTYPE OMF ...> => kind='DECL, head='DOCTYPE
491;       <?xml version="1.0"?> => kind='PI, head='xml
492;       &my-ent; => kind = 'ENTITY-REF, head='my-ent
493;
494; Character references are not represented by xml-tokens as these references
495; are transparently resolved into the corresponding characters.
496;
497
498
499
500; XML-DECL -- a record
501
502; The following is Gambit-specific, see below for a portable declaration
503;(define-structure xml-decl elems entities notations)
504
505; The record represents a datatype of an XML document: the list of
506; declared elements and their attributes, declared notations, list of
507; replacement strings or loading procedures for parsed general
508; entities, etc. Normally an xml-decl record is created from a DTD or
509; an XML Schema, although it can be created and filled in in many other
510; ways (e.g., loaded from a file).
511;
512; elems: an (assoc) list of decl-elem or #f. The latter instructs
513;       the parser to do no validation of elements and attributes.
514;
515; decl-elem: declaration of one element:
516;       (elem-name elem-content decl-attrs)
517;       elem-name is an UNRES-NAME for the element.
518;       elem-content is an ELEM-CONTENT-MODEL.
519;       decl-attrs is an ATTLIST, of (ATTR-NAME . VALUE) associations
520; !!!This element can declare a user procedure to handle parsing of an
521; element (e.g., to do a custom validation, or to build a hash of
522; IDs as they're encountered).
523;
524; decl-attr: an element of an ATTLIST, declaration of one attribute
525;       (attr-name content-type use-type default-value)
526;       attr-name is an UNRES-NAME for the declared attribute
527;       content-type is a symbol: CDATA, NMTOKEN, NMTOKENS, ...
528;               or a list of strings for the enumerated type.
529;       use-type is a symbol: REQUIRED, IMPLIED, FIXED
530;       default-value is a string for the default value, or #f if not given.
531;
532;
533
534; see a function make-empty-xml-decl to make a XML declaration entry
535; suitable for a non-validating parsing.
536
537
538;-------------------------
539; Utilities
540
541; The following is a function that is often used in validation tests,
542; to make sure that the computed result matches the expected one.
543; This function is a standard equal? predicate with one exception.
544; On Scheme systems where (string->symbol "A") and a symbol A
545; are the same, equal_? is precisely equal?
546; On other Scheme systems, we compare symbols disregarding their case.
547; Since this function is used only in tests, we don't have to
548; strive to make it efficient.
549(run-test
550 (define (equal_? e1 e2)
551   (if (eq? 'A (string->symbol "A")) (equal? e1 e2)
552       (cond
553        ((symbol? e1)
554         (and (symbol? e2) 
555              (string-ci=? (symbol->string e1) (symbol->string e2))))
556        ((pair? e1)
557         (and (pair? e2)
558              (equal_? (car e1) (car e2)) (equal_? (cdr e1) (cdr e2))))
559        ((vector? e1)
560         (and (vector? e2) (equal_? (vector->list e1) (vector->list e2))))
561        (else
562         (equal? e1 e2)))))
563)
564
565; The following function, which is often used in validation tests,
566; lets us conveniently enter newline, CR and tab characters in a character
567; string.
568;       unesc-string: ESC-STRING -> STRING
569; where ESC-STRING is a character string that may contain
570;    %n  -- for #\newline
571;    %r  -- for #\return
572;    %t  -- for #\tab
573;    %%  -- for #\%
574;
575; The result of unesc-string is a character string with all %-combinations
576; above replaced with their character equivalents
577
578(run-test
579 (define (unesc-string str)
580   (call-with-input-string str
581     (lambda (port)
582       (let loop ((frags '()))
583         (let* ((token (next-token '() '(#\% *eof*) "unesc-string" port))
584                (cterm (read-char port))
585                (frags (cons token frags)))
586           (if (eof-object? cterm) (string-concatenate-reverse/shared frags)
587             (let ((cchar (read-char port)))  ; char after #\%
588               (if (eof-object? cchar)
589                 (error "unexpected EOF after reading % in unesc-string:" str)
590                 (loop
591                   (cons
592                     (case cchar
593                       ((#\n) (string #\newline))
594                       ((#\r) (string char-return))
595                       ((#\t) (string char-tab))
596                       ((#\%) "%")
597                       (else (error "bad %-char in unesc-string:" cchar)))
598                     frags))))))))))
599)
600             
601
602; Test if a string is made of only whitespace
603; An empty string is considered made of whitespace as well
604(define (string-whitespace? str)
605  (let ((len (string-length str)))
606    (cond
607     ((zero? len) #t)
608     ((= 1 len) (char-whitespace? (string-ref str 0)))
609     ((= 2 len) (and (char-whitespace? (string-ref str 0))
610                     (char-whitespace? (string-ref str 1))))
611     (else
612      (let loop ((i 0))
613        (or (>= i len)
614            (and (char-whitespace? (string-ref str i))
615                 (loop (inc i)))))))))
616
617; Find val in alist
618; Return (values found-el remaining-alist) or
619;        (values #f alist)
620
621(define (assq-values val alist)
622  (let loop ((alist alist) (scanned '()))
623    (cond
624     ((null? alist) (values #f scanned))
625     ((equal? val (caar alist))
626      (values (car alist) (append scanned (cdr alist))))
627     (else
628      (loop (cdr alist) (cons (car alist) scanned))))))
629
630
631;========================================================================
632;               Lower-level parsers and scanners
633;
634; They deal with primitive lexical units (Names, whitespaces, tags)
635; and with pieces of more generic productions. Most of these parsers
636; must be called in appropriate context. For example, ssax:complete-start-tag
637; must be called only when the start-tag has been detected and its GI
638; has been read.
639
640;------------------------------------------------------------------------
641;                       Low-level parsing code
642
643; Skip the S (whitespace) production as defined by
644; [3] S ::= (#x20 | #x9 | #xD | #xA)
645; The procedure returns the first not-whitespace character it
646; encounters while scanning the PORT. This character is left
647; on the input stream.
648
649(define ssax:S-chars (map ascii->char '(32 10 9 13)))
650
651(define (ssax:skip-S port)
652  (skip-while ssax:S-chars port))
653
654
655; Read a Name lexem and return it as string
656; [4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':'
657;                  | CombiningChar | Extender
658; [5] Name ::= (Letter | '_' | ':') (NameChar)*
659;
660; This code supports the XML Namespace Recommendation REC-xml-names,
661; which modifies the above productions as follows:
662;
663; [4] NCNameChar ::= Letter | Digit | '.' | '-' | '_'
664;                       | CombiningChar | Extender
665; [5] NCName ::= (Letter | '_') (NCNameChar)*
666; As the Rec-xml-names says,
667; "An XML document conforms to this specification if all other tokens
668; [other than element types and attribute names] in the document which
669; are required, for XML conformance, to match the XML production for
670; Name, match this specification's production for NCName."
671; Element types and attribute names must match the production QName,
672; defined below.
673
674; Check to see if a-char may start a NCName
675(define (ssax:ncname-starting-char? a-char)
676  (and (char? a-char)
677    (or
678      (char-alphabetic? a-char)
679      (char=? #\_ a-char))))
680
681
682; Read a NCName starting from the current position in the PORT and
683; return it as a symbol.
684(define (ssax:read-NCName port)
685  (let ((first-char (peek-char port)))
686    (or (ssax:ncname-starting-char? first-char)
687      (parser-error port "XMLNS [4] for '" first-char "'")))
688  (string->symbol
689    (next-token-of
690      (lambda (c)
691        (cond
692          ((eof-object? c) #f)
693          ((char-alphabetic? c) c)
694          ((string-index "0123456789.-_" c) c)
695          (else #f)))
696      port)))
697
698; Read a (namespace-) Qualified Name, QName, from the current
699; position in the PORT.
700; From REC-xml-names:
701;       [6] QName ::= (Prefix ':')? LocalPart
702;       [7] Prefix ::= NCName
703;       [8] LocalPart ::= NCName
704; Return: an UNRES-NAME
705(define (ssax:read-QName port)
706  (let ((prefix-or-localpart (ssax:read-NCName port)))
707    (case (peek-char port)
708      ((#\:)                    ; prefix was given after all
709       (read-char port)         ; consume the colon
710       (cons prefix-or-localpart (ssax:read-NCName port)))
711      (else prefix-or-localpart) ; Prefix was omitted
712      )))
713
714; The prefix of the pre-defined XML namespace
715(define ssax:Prefix-XML (string->symbol "xml"))
716
717(run-test
718 (assert (eq? '_
719                 (call-with-input-string "_" ssax:read-NCName)))
720 (assert (eq? '_
721                 (call-with-input-string "_" ssax:read-QName)))
722 (assert (eq? (string->symbol "_abc_")
723              (call-with-input-string "_abc_;" ssax:read-NCName)))
724 (assert (eq? (string->symbol "_abc_")
725              (call-with-input-string "_abc_;" ssax:read-QName)))
726 (assert (eq? (string->symbol "_a.b")
727              (call-with-input-string "_a.b " ssax:read-QName)))
728 (assert (equal? (cons (string->symbol "_a.b") (string->symbol "d.1-ef-"))
729              (call-with-input-string "_a.b:d.1-ef-;" ssax:read-QName)))
730 (assert (equal? (cons (string->symbol "a") (string->symbol "b"))
731              (call-with-input-string "a:b:c" ssax:read-QName)))
732
733 (assert (failed? (call-with-input-string ":abc" ssax:read-NCName)))
734 (assert (failed? (call-with-input-string "1:bc" ssax:read-NCName)))
735)
736
737; Compare one RES-NAME or an UNRES-NAME with the other.
738; Return a symbol '<, '>, or '= depending on the result of
739; the comparison.
740; Names without PREFIX are always smaller than those with the PREFIX.
741(define name-compare
742  (letrec ((symbol-compare
743            (lambda (symb1 symb2)
744              (cond
745               ((eq? symb1 symb2) '=)
746               ((string<? (symbol->string symb1) (symbol->string symb2))
747                '<)
748               (else '>)))))
749    (lambda (name1 name2)
750      (cond
751       ((symbol? name1) (if (symbol? name2) (symbol-compare name1 name2)
752                            '<))
753       ((symbol? name2) '>)
754       ((eq? name2 ssax:largest-unres-name) '<)
755       ((eq? name1 ssax:largest-unres-name) '>)
756       ((eq? (car name1) (car name2))   ; prefixes the same
757        (symbol-compare (cdr name1) (cdr name2)))
758       (else (symbol-compare (car name1) (car name2)))))))
759
760; An UNRES-NAME that is postulated to be larger than anything that can occur in
761; a well-formed XML document.
762; name-compare enforces this postulate.
763(define ssax:largest-unres-name (cons
764                                  (string->symbol "#LARGEST-SYMBOL")
765                                  (string->symbol "#LARGEST-SYMBOL")))
766
767(run-test
768 (assert (eq? '= (name-compare 'ABC 'ABC)))
769 (assert (eq? '< (name-compare 'ABC 'ABCD)))
770 (assert (eq? '> (name-compare 'XB 'ABCD)))
771 (assert (eq? '> (name-compare '(HTML . PRE) 'PRE)))
772 (assert (eq? '< (name-compare 'HTML '(HTML . PRE))))
773 (assert (eq? '= (name-compare '(HTML . PRE) '(HTML . PRE))))
774 (assert (eq? '< (name-compare '(HTML . PRE) '(XML . PRE))))
775 (assert (eq? '> (name-compare '(HTML . PRE) '(HTML . P))))
776 (assert (eq? '< (name-compare '(HTML . PRE) ssax:largest-unres-name)))
777 (assert (eq? '< (name-compare '(ZZZZ . ZZZ) ssax:largest-unres-name)))
778 (assert (eq? '> (name-compare ssax:largest-unres-name '(ZZZZ . ZZZ) )))
779)
780
781
782
783; procedure: ssax:read-markup-token PORT
784; This procedure starts parsing of a markup token. The current position
785; in the stream must be #\<. This procedure scans enough of the input stream
786; to figure out what kind of a markup token it is seeing. The procedure returns
787; an xml-token structure describing the token. Note, generally reading
788; of the current markup is not finished! In particular, no attributes of
789; the start-tag token are scanned.
790;
791; Here's a detailed break out of the return values and the position in the PORT
792; when that particular value is returned:
793;       PI-token:       only PI-target is read.
794;                       To finish the Processing Instruction and disregard it,
795;                       call ssax:skip-pi. ssax:read-attributes may be useful
796;                       as well (for PIs whose content is attribute-value
797;                       pairs)
798;       END-token:      The end tag is read completely; the current position
799;                       is right after the terminating #\> character.   
800;       COMMENT         is read and skipped completely. The current position
801;                       is right after "-->" that terminates the comment.
802;       CDSECT          The current position is right after "<!CDATA["
803;                       Use ssax:read-cdata-body to read the rest.
804;       DECL            We have read the keyword (the one that follows "<!")
805;                       identifying this declaration markup. The current
806;                       position is after the keyword (usually a
807;                       whitespace character)
808;
809;       START-token     We have read the keyword (GI) of this start tag.
810;                       No attributes are scanned yet. We don't know if this
811;                       tag has an empty content either.
812;                       Use ssax:complete-start-tag to finish parsing of
813;                       the token.
814
815(define ssax:read-markup-token ; procedure ssax:read-markup-token port
816 (let ()
817                ; we have read "<!-". Skip through the rest of the comment
818                ; Return the 'COMMENT token as an indication we saw a comment
819                ; and skipped it.
820  (define (skip-comment port)
821    (assert-curr-char '(#\-) "XML [15], second dash" port)
822    (if (not (find-string-from-port? "-->" port))
823      (parser-error port "XML [15], no -->"))
824    (make-xml-token 'COMMENT #f))
825
826                ; we have read "<![" that must begin a CDATA section
827  (define (read-cdata port)
828    (assert (string=? "CDATA[" (read-string 6 port)))
829    (make-xml-token 'CDSECT #f))
830
831  (lambda (port)
832    (assert-curr-char '(#\<) "start of the token" port)
833    (case (peek-char port)
834      ((#\/) (read-char port)
835       (begin0 (make-xml-token 'END (ssax:read-QName port))
836               (ssax:skip-S port)
837               (assert-curr-char '(#\>) "XML [42]" port)))
838      ((#\?) (read-char port) (make-xml-token 'PI (ssax:read-NCName port)))
839      ((#\!)
840       (case (peek-next-char port)
841         ((#\-) (read-char port) (skip-comment port))
842         ((#\[) (read-char port) (read-cdata port))
843         (else (make-xml-token 'DECL (ssax:read-NCName port)))))
844      (else (make-xml-token 'START (ssax:read-QName port)))))
845))
846
847
848; The current position is inside a PI. Skip till the rest of the PI
849(define (ssax:skip-pi port)     
850  (if (not (find-string-from-port? "?>" port))
851    (parser-error port "Failed to find ?> terminating the PI")))
852
853
854; procedure: ssax:read-pi-body-as-string PORT
855; The current position is right after reading the PITarget. We read the
856; body of PI and return is as a string. The port will point to the
857; character right after '?>' combination that terminates PI.
858; [16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
859
860(define (ssax:read-pi-body-as-string port)
861  (ssax:skip-S port)            ; skip WS after the PI target name
862  (string-concatenate/shared
863    (let loop ()
864      (let ((pi-fragment
865             (next-token '() '(#\?) "reading PI content" port)))
866        (if (eqv? #\> (peek-next-char port))
867            (begin
868              (read-char port)
869              (cons pi-fragment '()))
870            (cons* pi-fragment "?" (loop)))))))
871
872(run-test
873 (assert (equal? "p1 content "
874    (call-with-input-string "<?pi1  p1 content ?>"
875      (lambda (port)
876        (ssax:read-markup-token port)
877        (ssax:read-pi-body-as-string port)))))
878 (assert (equal? "pi2? content? ?"
879    (call-with-input-string "<?pi2 pi2? content? ??>"
880      (lambda (port)
881        (ssax:read-markup-token port)
882        (ssax:read-pi-body-as-string port)))))
883)
884
885;(define (ssax:read-pi-body-as-name-values port)
886
887; procedure: ssax:skip-internal-dtd PORT
888; The current pos in the port is inside an internal DTD subset
889; (e.g., after reading #\[ that begins an internal DTD subset)
890; Skip until the "]>" combination that terminates this DTD
891(define (ssax:skip-internal-dtd port)     
892  (if (not (find-string-from-port? "]>" port))
893    (parser-error port
894                  "Failed to find ]> terminating the internal DTD subset")))
895
896
897; procedure+: ssax:read-cdata-body PORT STR-HANDLER SEED
898;
899; This procedure must be called after we have read a string "<![CDATA["
900; that begins a CDATA section. The current position must be the first
901; position of the CDATA body. This function reads _lines_ of the CDATA
902; body and passes them to a STR-HANDLER, a character data consumer.
903;
904; The str-handler is a STR-HANDLER, a procedure STRING1 STRING2 SEED.
905; The first STRING1 argument to STR-HANDLER never contains a newline.
906; The second STRING2 argument often will. On the first invocation of
907; the STR-HANDLER, the seed is the one passed to ssax:read-cdata-body
908; as the third argument. The result of this first invocation will be
909; passed as the seed argument to the second invocation of the line
910; consumer, and so on. The result of the last invocation of the
911; STR-HANDLER is returned by the ssax:read-cdata-body.  Note a
912; similarity to the fundamental 'fold' iterator.
913;
914; Within a CDATA section all characters are taken at their face value,
915; with only three exceptions:
916;       CR, LF, and CRLF are treated as line delimiters, and passed
917;       as a single #\newline to the STR-HANDLER
918;       "]]>" combination is the end of the CDATA section.
919;       &gt; is treated as an embedded #\> character
920; Note, &lt; and &amp; are not specially recognized (and are not expanded)!
921
922(define ssax:read-cdata-body 
923  (let ((cdata-delimiters (list char-return #\newline #\] #\&)))
924
925    (lambda (port str-handler seed)
926      (let loop ((seed seed))
927        (let ((fragment (next-token '() cdata-delimiters
928                                    "reading CDATA" port)))
929                        ; that is, we're reading the char after the 'fragment'
930     (case (read-char port)     
931       ((#\newline) (loop (str-handler fragment nl seed)))
932       ((#\])
933        (if (not (eqv? (peek-char port) #\]))
934            (loop (str-handler fragment "]" seed))
935            (let check-after-second-braket
936                ((seed (if (string-null? fragment) seed
937                           (str-handler fragment "" seed))))
938              (case (peek-next-char port)       ; after the second bracket
939                ((#\>) (read-char port) seed)   ; we have read "]]>"
940                ((#\]) (check-after-second-braket
941                        (str-handler "]" "" seed)))
942                (else (loop (str-handler "]]" "" seed)))))))
943       ((#\&)           ; Note that #\& within CDATA may stand for itself
944        (let ((ent-ref  ; it does not have to start an entity ref
945               (next-token-of (lambda (c) 
946                 (and (not (eof-object? c)) (char-alphabetic? c) c)) port)))
947          (cond         ; "&gt;" is to be replaced with #\>
948           ((and (string=? "gt" ent-ref) (eqv? (peek-char port) #\;))
949            (read-char port)
950            (loop (str-handler fragment ">" seed)))
951           (else
952            (loop 
953             (str-handler ent-ref ""
954                          (str-handler fragment "&" seed)))))))
955       (else            ; Must be CR: if the next char is #\newline, skip it
956         (if (eqv? (peek-char port) #\newline) (read-char port))
957         (loop (str-handler fragment nl seed)))
958       ))))))
959
960; a few lines of validation code
961(run-test (letrec
962  ((consumer (lambda (fragment foll-fragment seed)
963     (cons* (if (equal? foll-fragment (string #\newline))
964                " NL" foll-fragment) fragment seed)))
965   (test (lambda (str expected-result)
966           (newline) (display "body: ") (write str)
967           (newline) (display "Result: ")
968           (let ((result
969                   (reverse
970                     (call-with-input-string (unesc-string str)
971                       (lambda (port) (ssax:read-cdata-body port consumer '()))
972                       ))))
973             (write result)
974             (assert (equal? result expected-result)))))
975   )
976  (test "]]>" '())
977  (test "abcd]]>" '("abcd" ""))
978  (test "abcd]]]>" '("abcd" "" "]" ""))
979  (test "abcd]]]]>" '("abcd" "" "]" "" "]" ""))
980  (test "abcd]]]]]>" '("abcd" "" "]" "" "]" "" "]" ""))
981  (test "abcd]]]a]]>" '("abcd" "" "]" "" "]]" "" "a" ""))
982  (test "abc%r%ndef%n]]>" '("abc" " NL" "def" " NL"))
983  (test "%r%n%r%n]]>" '("" " NL" "" " NL"))
984  (test "%r%n%r%na]]>" '("" " NL" "" " NL" "a" ""))
985  (test "%r%r%r%na]]>" '("" " NL" "" " NL" "" " NL" "a" ""))
986  (test "abc&!!!]]>" '("abc" "&" "" "" "!!!" ""))
987  (test "abc]]&gt;&gt&amp;]]]&gt;and]]>"
988    '("abc" "" "]]" "" "" ">" "" "&" "gt" "" "" "&" "amp" "" ";" "" "]" ""
989      "]]" "" "" ">" "and" ""))
990))
991
992           
993; procedure+: ssax:read-char-ref PORT
994;
995; [66]  CharRef ::=  '&#' [0-9]+ ';'
996;                  | '&#x' [0-9a-fA-F]+ ';'
997;
998; This procedure must be called after we we have read "&#"
999; that introduces a char reference.
1000; The procedure reads this reference and returns the corresponding char
1001; The current position in PORT will be after ";" that terminates
1002; the char reference
1003; Faults detected:
1004;       WFC: XML-Spec.html#wf-Legalchar
1005;
1006; According to Section "4.1 Character and Entity References"
1007; of the XML Recommendation:
1008;  "[Definition: A character reference refers to a specific character
1009;   in the ISO/IEC 10646 character set, for example one not directly
1010;   accessible from available input devices.]"
1011; Therefore, we use a ucscode->char function to convert a character
1012; code into the character -- *regardless* of the current character
1013; encoding of the input stream.
1014
1015(define (ssax:read-char-ref port)
1016  (let* ((base
1017           (cond ((eqv? (peek-char port) #\x) (read-char port) 16)
1018                 (else 10)))
1019         (name (next-token '() '(#\;) "XML [66]" port))
1020         (char-code (string->number name base)))
1021    (read-char port)    ; read the terminating #\; char
1022    (if (integer? char-code) (ucscode->char char-code)
1023      (parser-error port "[wf-Legalchar] broken for '" name "'"))))
1024
1025
1026; procedure+: ssax:handle-parsed-entity PORT NAME ENTITIES
1027;               CONTENT-HANDLER STR-HANDLER SEED
1028;
1029; Expand and handle a parsed-entity reference
1030; port - a PORT
1031; name - the name of the parsed entity to expand, a symbol
1032; entities - see ENTITIES
1033; content-handler -- procedure PORT ENTITIES SEED
1034;       that is supposed to return a SEED
1035; str-handler - a STR-HANDLER. It is called if the entity in question
1036; turns out to be a pre-declared entity
1037;
1038; The result is the one returned by CONTENT-HANDLER or STR-HANDLER
1039; Faults detected:
1040;       WFC: XML-Spec.html#wf-entdeclared
1041;       WFC: XML-Spec.html#norecursion
1042
1043(define ssax:predefined-parsed-entities
1044  `(
1045    (,(string->symbol "amp") . "&")
1046    (,(string->symbol "lt") . "<")
1047    (,(string->symbol "gt") . ">")
1048    (,(string->symbol "apos") . "'")
1049    (,(string->symbol "quot") . "\"")))
1050
1051(define (ssax:handle-parsed-entity port name entities
1052                                   content-handler str-handler seed)
1053  (cond   ; First we check the list of the declared entities
1054   ((assq name entities) =>
1055    (lambda (decl-entity)
1056      (let ((ent-body (cdr decl-entity)) ; mark the list to prevent recursion
1057            (new-entities (cons (cons name #f) entities)))
1058        (cond
1059         ((string? ent-body)
1060          (call-with-input-string ent-body
1061             (lambda (port) (content-handler port new-entities seed))))
1062         ((procedure? ent-body)
1063          (let ((port (ent-body)))
1064            (begin0
1065             (content-handler port new-entities seed)
1066             (close-input-port port))))
1067         (else
1068          (parser-error port "[norecursion] broken for " name))))))
1069    ((assq name ssax:predefined-parsed-entities)
1070     => (lambda (decl-entity)
1071          (str-handler (cdr decl-entity) "" seed)))
1072    (else (parser-error port "[wf-entdeclared] broken for " name))))
1073
1074
1075
1076; procedure: make-empty-attlist
1077; The ATTLIST Abstract Data Type
1078; Currently is implemented as an assoc list sorted in the ascending
1079; order of NAMES.
1080
1081(define (make-empty-attlist) '())
1082
1083; procedure: attlist-add ATTLIST NAME-VALUE-PAIR
1084; Add a name-value pair to the existing attlist preserving the order
1085; Return the new list, in the sorted ascending order.
1086; Return #f if a pair with the same name already exists in the attlist
1087
1088(define (attlist-add attlist name-value)
1089  (if (null? attlist) (cons name-value attlist)
1090      (case (name-compare (car name-value) (caar attlist))
1091        ((=) #f)
1092        ((<) (cons name-value attlist))
1093        (else (cons (car attlist) (attlist-add (cdr attlist) name-value)))
1094        )))
1095
1096; procedure: attlist-null? ATTLIST
1097(define attlist-null? null?)
1098
1099; procedure: attlist-remove-top ATTLIST
1100; Given an non-null attlist, return a pair of values: the top and the rest
1101(define (attlist-remove-top attlist)
1102  (values (car attlist) (cdr attlist)))
1103
1104; procedure: attliast->alist
1105(define (attlist->alist attlist) attlist)
1106; procedure: attlist-fold
1107(define attlist-fold fold)
1108
1109; procedure+:   ssax:read-attributes PORT ENTITIES
1110;
1111; This procedure reads and parses a production Attribute*
1112; [41] Attribute ::= Name Eq AttValue
1113; [10] AttValue ::=  '"' ([^<&"] | Reference)* '"'
1114;                 | "'" ([^<&'] | Reference)* "'"
1115; [25] Eq ::= S? '=' S?
1116;
1117;
1118; The procedure returns an ATTLIST, of Name (as UNRES-NAME), Value (as string)
1119; pairs. The current character on the PORT is a non-whitespace character
1120; that is not an ncname-starting character.
1121;
1122; Note the following rules to keep in mind when reading an 'AttValue'
1123; "Before the value of an attribute is passed to the application
1124; or checked for validity, the XML processor must normalize it as follows:
1125; - a character reference is processed by appending the referenced
1126;   character to the attribute value
1127; - an entity reference is processed by recursively processing the
1128;   replacement text of the entity [see ENTITIES]
1129;   [named entities amp lt gt quot apos are assumed pre-declared]
1130; - a whitespace character (#x20, #xD, #xA, #x9) is processed by appending #x20
1131;   to the normalized value, except that only a single #x20 is appended for a
1132;   "#xD#xA" sequence that is part of an external parsed entity or the
1133;   literal entity value of an internal parsed entity
1134; - other characters are processed by appending them to the normalized value "
1135;
1136;
1137; Faults detected:
1138;       WFC: XML-Spec.html#CleanAttrVals
1139;       WFC: XML-Spec.html#uniqattspec
1140
1141(define ssax:read-attributes  ; ssax:read-attributes port entities
1142 (let ((value-delimeters (append ssax:S-chars '(#\< #\&))))
1143                ; Read the AttValue from the PORT up to the delimiter
1144                ; (which can be a single or double-quote character,
1145                ; or even a symbol *eof*)
1146                ; 'prev-fragments' is the list of string fragments, accumulated
1147                ; so far, in reverse order.
1148                ; Return the list of fragments with newly read fragments
1149                ; prepended.
1150  (define (read-attrib-value delimiter port entities prev-fragments)
1151    (let* ((new-fragments
1152            (cons (next-token '() (cons delimiter value-delimeters)
1153                              "XML [10]" port)
1154             prev-fragments))
1155           (cterm (read-char port)))
1156      (cond
1157        ((or (eof-object? cterm) (eqv? cterm delimiter))
1158          new-fragments)
1159        ((eqv? cterm char-return)       ; treat a CR and CRLF as a LF
1160          (if (eqv? (peek-char port) #\newline) (read-char port))
1161          (read-attrib-value delimiter port entities
1162                             (cons " " new-fragments)))
1163        ((memv cterm ssax:S-chars)
1164          (read-attrib-value delimiter port entities
1165                             (cons " " new-fragments)))
1166        ((eqv? cterm #\&)
1167          (cond
1168            ((eqv? (peek-char port) #\#)
1169              (read-char port)
1170              (read-attrib-value delimiter port entities
1171                (cons (string (ssax:read-char-ref port)) new-fragments)))
1172            (else
1173              (read-attrib-value delimiter port entities
1174                (read-named-entity port entities new-fragments)))))
1175        (else (parser-error port "[CleanAttrVals] broken")))))
1176
1177                ; we have read "&" that introduces a named entity reference.
1178                ; read this reference and return the result of
1179                ; normalizing of the corresponding string
1180                ; (that is, read-attrib-value is applied to the replacement
1181                ; text of the entity)
1182                ; The current position will be after ";" that terminates
1183                ; the entity reference
1184  (define (read-named-entity port entities fragments)
1185    (let ((name (ssax:read-NCName port)))
1186      (assert-curr-char '(#\;) "XML [68]" port)
1187      (ssax:handle-parsed-entity port name entities
1188        (lambda (port entities fragments)
1189          (read-attrib-value '*eof* port entities fragments))
1190        (lambda (str1 str2 fragments)
1191          (if (equal? "" str2) (cons str1 fragments)
1192              (cons* str2 str1 fragments)))
1193        fragments)))
1194
1195  (lambda (port entities)
1196    (let loop ((attr-list (make-empty-attlist)))
1197      (if (not (ssax:ncname-starting-char? (ssax:skip-S port))) attr-list
1198          (let ((name (ssax:read-QName port)))
1199            (ssax:skip-S port)
1200            (assert-curr-char '(#\=) "XML [25]" port)
1201            (ssax:skip-S port)
1202            (let ((delimiter 
1203                   (assert-curr-char '(#\' #\" ) "XML [10]" port)))
1204              (loop 
1205               (or (attlist-add attr-list 
1206                     (cons name 
1207                           (string-concatenate-reverse/shared
1208                             (read-attrib-value delimiter port entities
1209                                                      '()))))
1210                   (parser-error port "[uniqattspec] broken for " name))))))))
1211))
1212
1213; a few lines of validation code
1214(run-test (letrec
1215    ((test (lambda (str decl-entities expected-res)
1216             (newline) (display "input: ") (write str)
1217             (newline) (display "Result: ")
1218             (let ((result
1219                     (call-with-input-string (unesc-string str)
1220                       (lambda (port)
1221                         (ssax:read-attributes port decl-entities)))))
1222               (write result) (newline)
1223               (assert (equal? result expected-res))))))
1224    (test "" '() '())
1225    (test "href='http://a%tb%r%n%r%n%nc'" '()
1226          `((,(string->symbol "href") . "http://a b   c")))
1227    (test "href='http://a%tb%r%r%n%rc'" '()
1228          `((,(string->symbol "href") . "http://a b   c")))
1229    (test "_1 ='12&amp;' _2= \"%r%n%t12&#10;3\">" '()
1230          `((_1 . "12&") (_2 . ,(unesc-string "  12%n3"))))
1231    (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1232          '((ent . "&lt;xx&gt;"))
1233          `((,(string->symbol "Abc") . ,(unesc-string "<&>%n"))
1234            (,(string->symbol "Next") . "12<xx>34")))
1235    (test "%tAbc='&lt;&amp;&gt;&#x0d;'%nNext='12&ent;34' />" 
1236          '((ent . "&lt;xx&gt;"))
1237          `((,(string->symbol "Abc") . ,(unesc-string "<&>%r"))
1238            (,(string->symbol "Next") . "12<xx>34")))
1239    (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&en;34' />" 
1240          `((en . ,(lambda () (open-input-string "&quot;xx&apos;"))))
1241          `((,(string->symbol "Abc") . ,(unesc-string "<&>%n"))
1242            (,(string->symbol "Next") . "12\"xx'34")))
1243    (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1244          '((ent . "&lt;&ent1;T;&gt;") (ent1 . "&amp;"))
1245          `((,(string->symbol "Abc") . ,(unesc-string "<&>%n"))
1246            (,(string->symbol "Next") . "12<&T;>34")))
1247    (assert (failed?
1248        (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1249          '((ent . "<&ent1;T;&gt;") (ent1 . "&amp;")) '())))
1250    (assert (failed?
1251        (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1252          '((ent . "&lt;&ent;T;&gt;") (ent1 . "&amp;")) '())))
1253    (assert (failed?
1254        (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1255          '((ent . "&lt;&ent1;T;&gt;") (ent1 . "&ent;")) '())))
1256    (test "html:href='http://a%tb%r%n%r%n%nc'" '()
1257          `(((,(string->symbol "html") . ,(string->symbol "href"))
1258             . "http://a b   c")))
1259    (test "html:href='ref1' html:src='ref2'" '()
1260          `(((,(string->symbol "html") . ,(string->symbol "href"))
1261             . "ref1")
1262            ((,(string->symbol "html") . ,(string->symbol "src"))
1263             . "ref2")))
1264    (test "html:href='ref1' xml:html='ref2'" '()
1265          `(((,(string->symbol "html") . ,(string->symbol "href"))
1266             . "ref1")
1267            ((,ssax:Prefix-XML . ,(string->symbol "html"))
1268             . "ref2")))
1269    (assert (failed? (test "html:href='ref1' html:href='ref2'" '() '())))
1270    (assert (failed? (test "html:href='<' html:href='ref2'" '() '())))
1271    (assert (failed? (test "html:href='ref1' html:href='&ref2;'" '() '())))
1272))
1273
1274; ssax:resolve-name PORT UNRES-NAME NAMESPACES apply-default-ns?
1275;
1276; Convert an UNRES-NAME to a RES-NAME given the appropriate NAMESPACES
1277; declarations.
1278; the last parameter apply-default-ns? determines if the default
1279; namespace applies (for instance, it does not for attribute names)
1280;
1281; Per REC-xml-names/#nsc-NSDeclared, "xml" prefix is considered pre-declared
1282; and bound to the namespace name "http://www.w3.org/XML/1998/namespace".
1283;
1284; This procedure tests for the namespace constraints:
1285; http://www.w3.org/TR/REC-xml-names/#nsc-NSDeclared
1286
1287(define (ssax:resolve-name port unres-name namespaces apply-default-ns?)
1288  (cond
1289   ((pair? unres-name)          ; it's a QNAME
1290    (cons
1291     (cond
1292     ((assq (car unres-name) namespaces) => cadr)
1293     ((eq? (car unres-name) ssax:Prefix-XML) ssax:Prefix-XML)
1294     (else
1295      (parser-error port "[nsc-NSDeclared] broken; prefix " (car unres-name))))
1296     (cdr unres-name)))
1297   (apply-default-ns?           ; Do apply the default namespace, if any
1298    (let ((default-ns (assq '*DEFAULT* namespaces)))
1299      (if (and default-ns (cadr default-ns))
1300          (cons (cadr default-ns) unres-name)
1301          unres-name)))         ; no default namespace declared
1302   (else unres-name)))          ; no prefix, don't apply the default-ns
1303           
1304         
1305(run-test
1306 (let* ((namespaces
1307        '((HTML UHTML . URN-HTML)
1308          (HTML UHTML-1 . URN-HTML)
1309          (A    UHTML . URN-HTML)))
1310        (namespaces-def
1311         (cons
1312          '(*DEFAULT* DEF . URN-DEF) namespaces))
1313        (namespaces-undef
1314         (cons
1315          '(*DEFAULT* #f . #f) namespaces-def))
1316        (port (current-input-port)))
1317
1318   (assert (equal? 'ABC 
1319                   (ssax:resolve-name port 'ABC namespaces #t)))
1320   (assert (equal? '(DEF . ABC)
1321                   (ssax:resolve-name port 'ABC namespaces-def #t)))
1322   (assert (equal? 'ABC
1323                   (ssax:resolve-name port 'ABC namespaces-def #f)))
1324   (assert (equal? 'ABC
1325                   (ssax:resolve-name port 'ABC namespaces-undef #t)))
1326   (assert (equal? '(UHTML . ABC)
1327                   (ssax:resolve-name port '(HTML . ABC) namespaces-def #t)))
1328   (assert (equal? '(UHTML . ABC)
1329                   (ssax:resolve-name port '(HTML . ABC) namespaces-def #f)))
1330   (assert (equal? `(,ssax:Prefix-XML . space)
1331                   (ssax:resolve-name port 
1332                       `(,(string->symbol "xml") . space) namespaces-def #f)))
1333   (assert (failed?
1334                   (ssax:resolve-name port '(XXX . ABC) namespaces-def #f)))
1335))
1336
1337
1338; procedure+: ssax:uri-string->symbol URI-STR
1339; Convert a URI-STR to an appropriate symbol
1340(define (ssax:uri-string->symbol uri-str)
1341  (string->symbol uri-str))
1342
1343; procedure+: ssax:complete-start-tag TAG PORT ELEMS ENTITIES NAMESPACES
1344;
1345; This procedure is to complete parsing of a start-tag markup. The
1346; procedure must be called after the start tag token has been
1347; read. TAG is an UNRES-NAME. ELEMS is an instance of xml-decl::elems;
1348; it can be #f to tell the function to do _no_ validation of elements
1349; and their attributes.
1350;
1351; This procedure returns several values:
1352;  ELEM-GI: a RES-NAME.
1353;  ATTRIBUTES: element's attributes, an ATTLIST of (RES-NAME . STRING)
1354;       pairs. The list does NOT include xmlns attributes.
1355;  NAMESPACES: the input list of namespaces amended with namespace
1356;       (re-)declarations contained within the start-tag under parsing
1357;  ELEM-CONTENT-MODEL
1358
1359; On exit, the current position in PORT will be the first character after
1360; #\> that terminates the start-tag markup.
1361;
1362; Faults detected:
1363;       VC: XML-Spec.html#enum
1364;       VC: XML-Spec.html#RequiredAttr
1365;       VC: XML-Spec.html#FixedAttr
1366;       VC: XML-Spec.html#ValueType
1367;       WFC: XML-Spec.html#uniqattspec (after namespaces prefixes are resolved)
1368;       VC: XML-Spec.html#elementvalid
1369;       WFC: REC-xml-names/#dt-NSName
1370
1371; Note, although XML Recommendation does not explicitly say it,
1372; xmlns and xmlns: attributes don't have to be declared (although they
1373; can be declared, to specify their default value)
1374
1375; Procedure:  ssax:complete-start-tag tag-head port elems entities namespaces
1376(define ssax:complete-start-tag
1377
1378 (let ((xmlns (string->symbol "xmlns"))
1379       (largest-dummy-decl-attr (list ssax:largest-unres-name #f #f #f)))
1380
1381  ; Scan through the attlist and validate it, against decl-attrs
1382  ; Return an assoc list with added fixed or implied attrs.
1383  ; Note that both attlist and decl-attrs are ATTLISTs, and therefore,
1384  ; sorted
1385  (define (validate-attrs port attlist decl-attrs)
1386
1387    ; Check to see decl-attr is not of use type REQUIRED. Add
1388    ; the association with the default value, if any declared
1389    (define (add-default-decl decl-attr result)
1390      (let*-values
1391         (((attr-name content-type use-type default-value)
1392           (apply values decl-attr)))
1393         (and (eq? use-type 'REQUIRED)
1394              (parser-error port "[RequiredAttr] broken for" attr-name))
1395         (if default-value
1396             (cons (cons attr-name default-value) result)
1397             result)))
1398
1399    (let loop ((attlist attlist) (decl-attrs decl-attrs) (result '()))
1400      (if (attlist-null? attlist)
1401          (attlist-fold add-default-decl result decl-attrs)
1402          (let*-values
1403           (((attr attr-others)
1404             (attlist-remove-top attlist))
1405            ((decl-attr other-decls)
1406             (if (attlist-null? decl-attrs)
1407                 (values largest-dummy-decl-attr decl-attrs)
1408                 (attlist-remove-top decl-attrs)))
1409            )
1410           (case (name-compare (car attr) (car decl-attr))
1411             ((<) 
1412              (if (or (eq? xmlns (car attr))
1413                      (and (pair? (car attr)) (eq? xmlns (caar attr))))
1414                  (loop attr-others decl-attrs (cons attr result))
1415                  (parser-error port "[ValueType] broken for " attr)))
1416             ((>) 
1417              (loop attlist other-decls 
1418                    (add-default-decl decl-attr result)))
1419             (else      ; matched occurrence of an attr with its declaration
1420              (let*-values
1421               (((attr-name content-type use-type default-value)
1422                 (apply values decl-attr)))
1423               ; Run some tests on the content of the attribute
1424               (cond
1425                ((eq? use-type 'FIXED)
1426                 (or (equal? (cdr attr) default-value)
1427                     (parser-error port "[FixedAttr] broken for " attr-name)))
1428                ((eq? content-type 'CDATA) #t) ; everything goes
1429                ((pair? content-type)
1430                 (or (member (cdr attr) content-type)
1431                     (parser-error port "[enum] broken for " attr-name "="
1432                            (cdr attr))))
1433                (else
1434                 (ssax:warn port "declared content type " content-type
1435                       " not verified yet")))
1436               (loop attr-others other-decls (cons attr result)))))
1437           ))))
1438           
1439
1440  ; Add a new namespace declaration to namespaces.
1441  ; First we convert the uri-str to a uri-symbol and search namespaces for
1442  ; an association (_ user-prefix . uri-symbol).
1443  ; If found, we return the argument namespaces with an association
1444  ; (prefix user-prefix . uri-symbol) prepended.
1445  ; Otherwise, we prepend (prefix uri-symbol . uri-symbol)
1446  (define (add-ns port prefix uri-str namespaces)
1447    (and (equal? "" uri-str)
1448         (parser-error port "[dt-NSName] broken for " prefix))
1449    (let ((uri-symbol (ssax:uri-string->symbol uri-str)))
1450      (let loop ((nss namespaces))
1451        (cond
1452         ((null? nss)
1453          (cons (cons* prefix uri-symbol uri-symbol) namespaces))
1454         ((eq? uri-symbol (cddar nss))
1455          (cons (cons* prefix (cadar nss) uri-symbol) namespaces))
1456         (else (loop (cdr nss)))))))
1457     
1458  ; partition attrs into proper attrs and new namespace declarations
1459  ; return two values: proper attrs and the updated namespace declarations
1460  (define (adjust-namespace-decl port attrs namespaces)
1461    (let loop ((attrs attrs) (proper-attrs '()) (namespaces namespaces))
1462      (cond
1463       ((null? attrs) (values proper-attrs namespaces))
1464       ((eq? xmlns (caar attrs))        ; re-decl of the default namespace
1465        (loop (cdr attrs) proper-attrs 
1466              (if (equal? "" (cdar attrs))      ; un-decl of the default ns
1467                  (cons (cons* '*DEFAULT* #f #f) namespaces)
1468                  (add-ns port '*DEFAULT* (cdar attrs) namespaces))))
1469       ((and (pair? (caar attrs)) (eq? xmlns (caaar attrs)))
1470        (loop (cdr attrs) proper-attrs
1471              (add-ns port (cdaar attrs) (cdar attrs) namespaces)))
1472       (else
1473        (loop (cdr attrs) (cons (car attrs) proper-attrs) namespaces)))))
1474
1475    ; The body of the function
1476 (lambda (tag-head port elems entities namespaces)
1477  (let*-values
1478   (((attlist) (ssax:read-attributes port entities))
1479    ((empty-el-tag?)
1480     (begin
1481       (ssax:skip-S port)
1482       (and
1483        (eqv? #\/ 
1484              (assert-curr-char '(#\> #\/) "XML [40], XML [44], no '>'" port))
1485        (assert-curr-char '(#\>) "XML [44], no '>'" port))))
1486    ((elem-content decl-attrs)  ; see xml-decl for their type
1487     (if elems                  ; elements declared: validate!
1488         (cond
1489          ((assoc tag-head elems) =>
1490           (lambda (decl-elem)          ; of type xml-decl::decl-elem
1491             (values
1492              (if empty-el-tag? 'EMPTY-TAG (cadr decl-elem))
1493              (caddr decl-elem))))
1494          (else
1495           (parser-error port "[elementvalid] broken, no decl for " tag-head)))
1496         (values                ; non-validating parsing
1497          (if empty-el-tag? 'EMPTY-TAG 'ANY)
1498          #f)                   ; no attributes declared
1499         ))
1500    ((merged-attrs) (if decl-attrs (validate-attrs port attlist decl-attrs)
1501                      (attlist->alist attlist)))
1502    ((proper-attrs namespaces)
1503     (adjust-namespace-decl port merged-attrs namespaces))
1504    )
1505   ;(cerr "proper attrs: " proper-attrs nl)
1506   ; build the return value
1507   (values
1508    (ssax:resolve-name port tag-head namespaces #t)
1509    (fold-right
1510     (lambda (name-value attlist)
1511       (or
1512        (attlist-add attlist
1513           (cons (ssax:resolve-name port (car name-value) namespaces #f)
1514                 (cdr name-value)))
1515        (parser-error port "[uniqattspec] after NS expansion broken for " 
1516               name-value)))
1517     (make-empty-attlist)
1518     proper-attrs)
1519    namespaces
1520    elem-content)))))
1521
1522(run-test
1523 (let* ((urn-a (string->symbol "urn:a"))
1524        (urn-b (string->symbol "urn:b"))
1525        (urn-html (string->symbol "http://w3c.org/html"))
1526        (namespaces
1527         `((#f '"UHTML" . ,urn-html)
1528           ('"A"  '"UA" . ,urn-a)))
1529          (test
1530           (lambda (tag-head-name elems str)
1531             (call-with-input-string str
1532                (lambda (port)
1533                  (call-with-values
1534                      (lambda ()
1535                              (ssax:complete-start-tag
1536                               (call-with-input-string tag-head-name
1537                                      (lambda (port) (ssax:read-QName port)))
1538                               port
1539                               elems '() namespaces))
1540                    list))))))
1541
1542   ; First test with no validation of elements
1543   ;(test "TAG1" #f "")
1544   (assert (equal? `('"TAG1" () ,namespaces ANY)
1545                   (test "TAG1" #f ">")))
1546   (assert (equal? `('"TAG1" () ,namespaces EMPTY-TAG)
1547                   (test "TAG1" #f "/>")))
1548   (assert (equal? `('"TAG1" (('"HREF" . "a")) ,namespaces EMPTY-TAG)
1549                   (test "TAG1" #f "HREF='a'/>")))
1550   (assert (equal? `(('"UA" . '"TAG1") (('"HREF" . "a"))
1551                     ,(cons `(*DEFAULT* '"UA" . ,urn-a) namespaces) ANY)
1552                   (test "TAG1" #f "HREF='a' xmlns='urn:a'>")))
1553   (assert (equal? `('"TAG1" (('"HREF" . "a"))
1554                     ,(cons '(*DEFAULT* #f . #f) namespaces) ANY)
1555                   (test "TAG1" #f "HREF='a' xmlns=''>")))
1556   (assert (failed? (test "UA:TAG1" #f "HREF='a' xmlns=''/>")))
1557   (assert (equal? `(('"UA" . '"TAG1") ((('"UA" . '"HREF") . "a"))
1558                     ,(cons '(*DEFAULT* #f . #f) namespaces) ANY)
1559                   (test "A:TAG1" #f "A:HREF='a' xmlns=''>")))
1560   (assert (equal? `(('"UA" . '"TAG1") ((('"UA" . '"HREF") . "a"))
1561                     ,(cons `(*DEFAULT* ,urn-b . ,urn-b) namespaces) ANY)
1562                   (test "A:TAG1" #f "A:HREF='a' xmlns='urn:b'>")))
1563   (assert (failed? (test "B:TAG1" #f "A:HREF='a' xmlns:b=''/>")))
1564   (assert (equal? `((,urn-b . '"TAG1") ((('"UA" . '"HREF") . "a"))
1565                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) ANY)
1566                   (test "B:TAG1" #f "A:HREF='a' xmlns:B='urn:b'>")))
1567   (assert (equal? `((,urn-b . '"TAG1") ((('"UA" . '"HREF") . "a")
1568                                         ((,urn-b . '"SRC") . "b"))
1569                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) ANY)
1570                   (test "B:TAG1" #f 
1571                         "B:SRC='b' A:HREF='a' xmlns:B='urn:b'>")))
1572   (assert (equal? `((,urn-b . '"TAG1") ((('"UA" . '"HREF") . "a")
1573                                         ((,urn-b . '"HREF") . "b"))
1574                          ,(cons `('"B" ,urn-b . ,urn-b) namespaces) ANY)
1575                   (test "B:TAG1" #f 
1576                         "B:HREF=\"b\" A:HREF='a' xmlns:B='urn:b'>")))
1577   ; must be an error! Duplicate attr
1578   (assert (failed? (test "B:TAG1" #f
1579                          "HREF=\"b\" HREF='a' xmlns:B='urn:a'/>")))
1580   ; must be an error! Duplicate attr after ns expansion
1581   (assert (failed? (test "B:TAG1" #f 
1582                          "B:HREF=\"b\" A:HREF='a' xmlns:B='urn:a'/>")))
1583   (assert (equal? `(('"UA" . '"TAG1") (('"HREF" . "a")
1584                                        (('"UA" . '"HREF") . "b"))
1585                     ,(cons `(*DEFAULT* '"UA" . ,urn-a) namespaces) ANY)
1586                   (test "TAG1" #f 
1587                         "A:HREF=\"b\" HREF='a' xmlns='urn:a'>")))
1588   (assert (equal? `('"TAG1" ((('"UHTML" . '"HREF") . "a")
1589                              ((,urn-b . '"HREF") . "b"))
1590                     ,(append `(
1591                         ('"HTML" '"UHTML" . ,urn-html)
1592                         ('"B" ,urn-b . ,urn-b))
1593                              namespaces) ANY)
1594                   (test "TAG1" #f 
1595                         "B:HREF=\"b\" xmlns:B='urn:b' xmlns:HTML='http://w3c.org/html' HTML:HREF='a' >")))
1596
1597   ; Now test the validating parsing
1598   ; No decl for tag1
1599   (assert (failed? (test "TAG1" '((TAG2 ANY ()))
1600                          "B:HREF='b' xmlns:B='urn:b'>")))
1601   ; No decl for HREF elem
1602;;   (cond-expand
1603;;    ((not (or scm mit-scheme))        ; Regretfully, SCM treats '() as #f
1604;;     (assert (failed?
1605;;            (test "TAG1" '(('"TAG1" ANY ()))
1606;;                  "B:HREF='b' xmlns:B='urn:b'>"))))
1607;;    (else #t))
1608   ; No decl for HREF elem
1609   (assert (failed?
1610            (test "TAG1" '(('"TAG1" ANY (('"HREF1" CDATA IMPLIED #f))))
1611            "B:HREF='b' xmlns:B='urn:b'>")))
1612   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces EMPTY-TAG)
1613       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f))))
1614             "HREF='b'/>")))
1615   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1616       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f))))
1617             "HREF='b'>")))
1618   ; Req'd attribute not given error
1619   (assert (failed? 
1620            (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f))))
1621                  ">")))
1622   ; Wrong content-type of the attribute
1623   (assert (failed? 
1624       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" ("c") REQUIRED #f))))
1625             "HREF='b'>")))
1626   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1627       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" ("c" "b") IMPLIED #f))))
1628             "HREF='b'>")))
1629   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1630       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA IMPLIED "c"))))
1631             "HREF='b'>")))
1632   ; Bad fixed attribute
1633   (assert (failed? 
1634         (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA FIXED "c"))))
1635               "HREF='b'>")))
1636   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1637       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA FIXED "b"))))
1638             "HREF='b'>")))
1639   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1640       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA FIXED "b")))) ">")))
1641   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1642       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA IMPLIED "b")))) ">")))
1643   (assert (equal? `('"TAG1" () ,namespaces PCDATA)
1644       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA IMPLIED #f)))) ">")))
1645   ; Undeclared attr
1646   (assert (failed? 
1647        (test "TAG1"
1648              '(('"TAG1" PCDATA ((('"A" . '"HREF") CDATA IMPLIED "c"))))
1649              "HREF='b'>")))
1650   (assert (equal? `('"TAG1" (('"HREF" . "b") (('"UA" . '"HREF") . "c"))
1651                          ,namespaces PCDATA)
1652       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f)
1653                                       (('"A" . '"HREF") CDATA IMPLIED "c"))))
1654             "HREF='b'>")))
1655   (assert (equal? `(('"UA" . '"TAG1")
1656                     (('"HREF" . "b") (('"UA" . '"HREF") . "c"))
1657                     ,namespaces PCDATA)
1658       (test "A:TAG1" '((('"A" . '"TAG1") PCDATA
1659                         (('"HREF" NMTOKEN REQUIRED #f)
1660                          (('"A" . '"HREF") CDATA IMPLIED "c"))))
1661             "HREF='b'>")))
1662   (assert (equal? `((,urn-b . '"TAG1") (('"HREF" . "b"))
1663                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) PCDATA)
1664       (test "B:TAG1" '((('"B" . '"TAG1") PCDATA (('"HREF" CDATA REQUIRED #f)
1665                           (('"xmlns" . '"B") CDATA IMPLIED "urn:b"))))
1666             "HREF='b'>")))
1667   (assert (equal? `((,urn-b . '"TAG1") (((,urn-b . '"HREF") . "b"))
1668                          ,(cons `('"B" ,urn-b . ,urn-b) namespaces) PCDATA)
1669       (test "B:TAG1" '((('"B" . '"TAG1") PCDATA
1670                         ((('"B" . '"HREF") CDATA REQUIRED #f)
1671                          (('"xmlns" . '"B") CDATA IMPLIED "urn:b"))))
1672             "B:HREF='b'>")))
1673   (assert (equal? `((,urn-b . '"TAG1") (('"HREF" . "b"))
1674                     ,(cons `(*DEFAULT* ,urn-b . ,urn-b) namespaces) PCDATA)
1675       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f)
1676                           ('"xmlns" CDATA IMPLIED "urn:b"))))
1677             "HREF='b'>")))
1678   ; xmlns not declared
1679   (assert (equal? `((,urn-b . '"TAG1") (('"HREF" . "b"))
1680                     ,(cons `(*DEFAULT* ,urn-b . ,urn-b) namespaces) PCDATA)
1681       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f)
1682                           )))
1683             "HREF='b' xmlns='urn:b'>")))
1684   ; xmlns:B not declared
1685   (assert (equal? `((,urn-b . '"TAG1") (((,urn-b . '"HREF") . "b"))
1686                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) PCDATA)
1687       (test "B:TAG1" '((('"B" . '"TAG1") PCDATA
1688                         ((('"B" . '"HREF") CDATA REQUIRED #f)
1689                           )))
1690             "B:HREF='b' xmlns:B='urn:b'>")))
1691))
1692
1693; procedure+: ssax:read-external-id PORT
1694;
1695; This procedure parses an ExternalID production:
1696; [75] ExternalID ::= 'SYSTEM' S SystemLiteral
1697;               | 'PUBLIC' S PubidLiteral S SystemLiteral
1698; [11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
1699; [12] PubidLiteral ::=  '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
1700; [13] PubidChar ::=  #x20 | #xD | #xA | [a-zA-Z0-9]
1701;                         | [-'()+,./:=?;!*#@$_%]
1702;
1703; This procedure is supposed to be called when an ExternalID is expected;
1704; that is, the current character must be either #\S or #\P that start
1705; correspondingly a SYSTEM or PUBLIC token. This procedure returns the
1706; SystemLiteral as a string. A PubidLiteral is disregarded if present.
1707 
1708(define (ssax:read-external-id port)
1709  (let ((discriminator (ssax:read-NCName port)))
1710    (assert-curr-char ssax:S-chars "space after SYSTEM or PUBLIC" port)
1711    (ssax:skip-S port)
1712    (let ((delimiter 
1713          (assert-curr-char '(#\' #\" ) "XML [11], XML [12]" port)))
1714      (cond
1715        ((eq? discriminator (string->symbol "SYSTEM"))
1716          (begin0
1717            (next-token '() (list delimiter) "XML [11]" port)
1718            (read-char port)    ; reading the closing delim
1719            ))
1720         ((eq? discriminator (string->symbol "PUBLIC"))
1721           (skip-until (list delimiter) port)
1722           (assert-curr-char ssax:S-chars "space after PubidLiteral" port)
1723           (ssax:skip-S port)
1724           (let* ((delimiter 
1725                  (assert-curr-char '(#\' #\" ) "XML [11]" port))
1726                  (systemid
1727                    (next-token '() (list delimiter) "XML [11]" port)))
1728                (read-char port)        ; reading the closing delim
1729                systemid))
1730         (else
1731           (parser-error port "XML [75], " discriminator 
1732                  " rather than SYSTEM or PUBLIC"))))))
1733
1734
1735;-----------------------------------------------------------------------------
1736;                       Higher-level parsers and scanners
1737;
1738; They parse productions corresponding to the whole (document) entity
1739; or its higher-level pieces (prolog, root element, etc).
1740
1741
1742; Scan the Misc production in the context
1743; [1]  document ::=  prolog element Misc*
1744; [22] prolog ::= XMLDecl? Misc* (doctypedec l Misc*)?
1745; [27] Misc ::= Comment | PI |  S
1746;
1747; The following function should be called in the prolog or epilog contexts.
1748; In these contexts, whitespaces are completely ignored.
1749; The return value from ssax:scan-Misc is either a PI-token,
1750; a DECL-token, a START token, or EOF.
1751; Comments are ignored and not reported.
1752
1753(define (ssax:scan-Misc port)
1754  (let loop ((c (ssax:skip-S port)))
1755    (cond
1756      ((eof-object? c) c)
1757      ((not (char=? c #\<))
1758        (parser-error port "XML [22], char '" c "' unexpected"))
1759      (else
1760        (let ((token (ssax:read-markup-token port)))
1761          (case (xml-token-kind token)
1762            ((COMMENT) (loop (ssax:skip-S port)))
1763            ((PI DECL START) token)
1764            (else
1765              (parser-error port "XML [22], unexpected token of kind "
1766                     (xml-token-kind token)
1767                     ))))))))
1768
1769; procedure+: ssax:read-char-data PORT EXPECT-EOF? STR-HANDLER SEED
1770;
1771; This procedure is to read the character content of an XML document
1772; or an XML element.
1773; [43] content ::=
1774;       (element | CharData | Reference | CDSect | PI
1775;       | Comment)*
1776; To be more precise, the procedure reads CharData, expands CDSect
1777; and character entities, and skips comments. The procedure stops
1778; at a named reference, EOF, at the beginning of a PI or a start/end tag.
1779;
1780; port
1781;       a PORT to read
1782; expect-eof?
1783;       a boolean indicating if EOF is normal, i.e., the character
1784;       data may be terminated by the EOF. EOF is normal
1785;       while processing a parsed entity.
1786; str-handler
1787;       a STR-HANDLER
1788; seed
1789;       an argument passed to the first invocation of STR-HANDLER.
1790;
1791; The procedure returns two results: SEED and TOKEN.
1792; The SEED is the result of the last invocation of STR-HANDLER, or the
1793; original seed if STR-HANDLER was never called.
1794;
1795; TOKEN can be either an eof-object (this can happen only if
1796; expect-eof? was #t), or:
1797;     - an xml-token describing a START tag or an END-tag;
1798;       For a start token, the caller has to finish reading it.
1799;     - an xml-token describing the beginning of a PI. It's up to an
1800;       application to read or skip through the rest of this PI;
1801;     - an xml-token describing a named entity reference.
1802;
1803; CDATA sections and character references are expanded inline and
1804; never returned. Comments are silently disregarded.
1805;
1806; As the XML Recommendation requires, all whitespace in character data
1807; must be preserved. However, a CR character (#xD) must be disregarded
1808; if it appears before a LF character (#xA), or replaced by a #xA character
1809; otherwise. See Secs. 2.10 and 2.11 of the XML Recommendation. See also
1810; the canonical XML Recommendation.
1811
1812        ; ssax:read-char-data port expect-eof? str-handler seed
1813(define ssax:read-char-data
1814 (let
1815     ((terminators-usual (list #\< #\& char-return))
1816      (terminators-usual-eof (list #\< '*eof* #\& char-return))
1817
1818      (handle-fragment
1819       (lambda (fragment str-handler seed)
1820         (if (string-null? fragment) seed
1821             (str-handler fragment "" seed))))
1822      )
1823
1824   (lambda (port expect-eof? str-handler seed)
1825
1826     ; Very often, the first character we encounter is #\<
1827     ; Therefore, we handle this case in a special, fast path
1828     (if (eqv? #\< (peek-char port))
1829
1830         ; The fast path
1831         (let ((token (ssax:read-markup-token port)))
1832           (case (xml-token-kind token)
1833             ((START END)       ; The most common case
1834              (values seed token))
1835             ((CDSECT)
1836              (let ((seed (ssax:read-cdata-body port str-handler seed)))
1837                (ssax:read-char-data port expect-eof? str-handler seed)))
1838             ((COMMENT) (ssax:read-char-data port expect-eof?
1839                                             str-handler seed))
1840             (else
1841              (values seed token))))
1842
1843
1844         ; The slow path
1845         (let ((char-data-terminators
1846                (if expect-eof? terminators-usual-eof terminators-usual)))
1847
1848           (let loop ((seed seed))
1849             (let* ((fragment
1850                     (next-token '() char-data-terminators 
1851                                 "reading char data" port))
1852                    (term-char (peek-char port)) ; one of char-data-terminators
1853                    )
1854               (if (eof-object? term-char)
1855                   (values
1856                    (handle-fragment fragment str-handler seed)
1857                    term-char)
1858                   (case term-char
1859                     ((#\<)
1860                      (let ((token (ssax:read-markup-token port)))
1861                        (case (xml-token-kind token)
1862                          ((CDSECT)
1863                           (loop
1864                            (ssax:read-cdata-body port str-handler
1865                                (handle-fragment fragment str-handler seed))))
1866                          ((COMMENT)
1867                           (loop (handle-fragment fragment str-handler seed)))
1868                          (else
1869                           (values
1870                            (handle-fragment fragment str-handler seed)
1871                            token)))))
1872                     ((#\&)
1873                      (case (peek-next-char port)
1874                        ((#\#) (read-char port) 
1875                         (loop (str-handler fragment
1876                                       (string (ssax:read-char-ref port))
1877                                       seed)))
1878                        (else
1879                         (let ((name (ssax:read-NCName port)))
1880                           (assert-curr-char '(#\;) "XML [68]" port)
1881                           (values
1882                            (handle-fragment fragment str-handler seed)
1883                            (make-xml-token 'ENTITY-REF name))))))
1884                     (else              ; This must be a CR character
1885                      (if (eqv? (peek-next-char port) #\newline)
1886                          (read-char port))
1887                      (loop (str-handler fragment (string #\newline) seed))))
1888                   ))))))))
1889
1890
1891; a few lines of validation code
1892(run-test (letrec
1893  ((a-tag (make-xml-token 'START (string->symbol "BR")))
1894   (a-ref (make-xml-token 'ENTITY-REF (string->symbol "lt")))
1895   (eof-object (lambda () eof-object)) ; a unique value
1896   (str-handler (lambda (fragment foll-fragment seed)
1897     (if (string-null? foll-fragment) (cons fragment seed)
1898         (cons* foll-fragment fragment seed))))
1899   (test (lambda (str expect-eof? expected-data expected-token)
1900           (newline) (display "body: ") (write str)
1901           (newline) (display "Result: ")
1902          (let*-values
1903           (((seed token)
1904             (call-with-input-string (unesc-string str)
1905                (lambda (port)
1906                 (ssax:read-char-data port expect-eof? str-handler '()))))
1907            ((result) (reverse seed)))
1908           (write result)
1909           (display " ")
1910           (display token)
1911           (assert (equal? result (map unesc-string expected-data))
1912                   (if (eq? expected-token eof-object)
1913                     (eof-object? token)
1914                     (equal? token expected-token))))))
1915   )
1916  (test "" #t '() eof-object)
1917  (assert (failed? (test "" #f '() eof-object)))
1918  (test "  " #t '("  ") eof-object)
1919  (test "<BR/>" #f '() a-tag)
1920  (test " <BR  />" #f '(" ") a-tag)
1921
1922  (test " &lt;" #f '(" ") a-ref)
1923  (test " a&lt;" #f '(" a") a-ref)
1924  (test " a &lt;" #f '(" a ") a-ref)
1925
1926  (test " <!-- comment--> a  a<BR/>" #f '(" " " a  a") a-tag)
1927  (test " <!-- comment-->%ra  a<BR/>" #f '(" " "" "%n" "a  a") a-tag)
1928  (test " <!-- comment-->%r%na  a<BR/>" #f '(" " "" "%n" "a  a") a-tag)
1929  (test " <!-- comment-->%r%na%t%r%r%na<BR/>" #f
1930        '(" " "" "%n" "a%t" "%n" "" "%n" "a") a-tag)
1931  (test "a<!-- comment--> a  a<BR/>" #f '("a" " a  a") a-tag)
1932  (test "&#x21;<BR/>" #f '("" "!") a-tag)
1933  (test "&#x21;%n<BR/>" #f '("" "!" "%n") a-tag)
1934  (test "%t&#x21;%n<BR/>" #f '("%t" "!" "%n") a-tag)
1935  (test "%t&#x21;%na a<BR/>" #f '("%t" "!" "%na a") a-tag)
1936  (test "%t&#x21;%ra a<BR/>" #f '("%t" "!" "" "%n" "a a") a-tag)
1937  (test "%t&#x21;%r%na a<BR/>" #f '("%t" "!" "" "%n" "a a") a-tag)
1938
1939  (test " %ta &#x21;   b <BR/>" #f '(" %ta " "!" "   b ") a-tag)
1940  (test " %ta &#x20;   b <BR/>" #f '(" %ta " " " "   b ") a-tag)
1941
1942  (test "<![CDATA[<]]><BR/>" #f '("<") a-tag)
1943  (test "<![CDATA[]]]><BR/>" #f '("]") a-tag)
1944  (test "%t<![CDATA[<]]><BR/>" #f '("%t" "<") a-tag)
1945  (test "%t<![CDATA[<]]>a b<BR/>" #f '("%t" "<" "a b") a-tag)
1946  (test "%t<![CDATA[<]]>  a b<BR/>" #f '("%t" "<" "  a b") a-tag)
1947
1948  (test "%td <![CDATA[  <%r%r%n]]>  a b<BR/>" #f 
1949        '("%td " "  <" "%n" "" "%n" "  a b") a-tag)
1950))
1951
1952
1953
1954; procedure+: ssax:assert-token TOKEN KIND GI
1955; Make sure that TOKEN is of anticipated KIND and has anticipated GI
1956; Note GI argument may actually be a pair of two symbols, Namespace
1957; URI or the prefix, and of the localname.
1958; If the assertion fails, error-cont is evaluated by passing it
1959; three arguments: token kind gi. The result of error-cont is returned.
1960(define (ssax:assert-token token kind gi error-cont)
1961  (or
1962    (and (xml-token? token)
1963      (eq? kind (xml-token-kind token))
1964      (equal? gi (xml-token-head token)))
1965    (error-cont token kind gi)))
1966
1967;========================================================================
1968;               Highest-level parsers: XML to SXML
1969
1970; These parsers are a set of syntactic forms to instantiate a SSAX parser.
1971; A user can instantiate the parser to do the full validation, or
1972; no validation, or any particular validation. The user specifies
1973; which PI he wants to be notified about. The user tells what to do
1974; with the parsed character and element data. The latter handlers
1975; determine if the parsing follows a SAX or a DOM model.
1976
1977; syntax: ssax:make-pi-parser my-pi-handlers
1978; Create a parser to parse and process one Processing Element (PI).
1979
1980; my-pi-handlers
1981;       An assoc list of pairs (PI-TAG . PI-HANDLER)
1982;       where PI-TAG is an NCName symbol, the PI target, and
1983;       PI-HANDLER is a procedure PORT PI-TAG SEED
1984;       where PORT points to the first symbol after the PI target.
1985;       The handler should read the rest of the PI up to and including
1986;       the combination '?>' that terminates the PI. The handler should
1987;       return a new seed.
1988;       One of the PI-TAGs may be the symbol *DEFAULT*. The corresponding
1989;       handler will handle PIs that no other handler will. If the
1990;       *DEFAULT* PI-TAG is not specified, ssax:make-pi-parser will assume
1991;       the default handler that skips the body of the PI
1992;       
1993; The output of the ssax:make-pi-parser is a procedure
1994;       PORT PI-TAG SEED
1995; that will parse the current PI according to the user-specified handlers.
1996;
1997; The previous version of ssax:make-pi-parser was a low-level macro:
1998; (define-macro ssax:make-pi-parser
1999;   (lambda (my-pi-handlers)
2000;   `(lambda (port target seed)
2001;     (case target
2002;       ; Generate the body of the case statement
2003;       ,@(let loop ((pi-handlers my-pi-handlers) (default #f))
2004;        (cond
2005;         ((null? pi-handlers)
2006;          (if default `((else (,default port target seed)))
2007;              '((else
2008;                 (ssax:warn port "Skipping PI: " target nl)
2009;                 (ssax:skip-pi port)
2010;                 seed))))
2011;         ((eq? '*DEFAULT* (caar pi-handlers))
2012;          (loop (cdr pi-handlers) (cdar pi-handlers)))
2013;         (else
2014;          (cons
2015;           `((,(caar pi-handlers)) (,(cdar pi-handlers) port target seed))
2016;           (loop (cdr pi-handlers) default)))))))))
2017
2018(define-syntax ssax:make-pi-parser
2019  (syntax-rules ()
2020    ((ssax:make-pi-parser orig-handlers)
2021     (letrec-syntax
2022        ; Generate the clauses of the case statement
2023      ((loop
2024         (syntax-rules (*DEFAULT*)
2025           ((loop () #f accum port target seed)         ; no default
2026            (make-case 
2027              ((else
2028                 (ssax:warn port "Skipping PI: " target nl)
2029                 (ssax:skip-pi port)
2030                 seed)
2031                . accum)
2032              () target))
2033           ((loop () default accum port target seed)
2034            (make-case 
2035              ((else (default port target seed)) . accum)
2036              () target))
2037           ((loop ((*DEFAULT* . default) . handlers) old-def accum
2038              port target seed)
2039            (loop handlers default accum port target seed))
2040           ((loop ((tag . handler) . handlers) default accum port target seed)
2041            (loop handlers default
2042              (((tag) (handler port target seed)) . accum)
2043              port target seed))
2044           ))
2045        (make-case                      ; Reverse the clauses, make the 'case'
2046          (syntax-rules ()
2047            ((make-case () clauses target)
2048             (case target . clauses))
2049            ((make-case (clause . clauses) accum target)
2050             (make-case clauses (clause . accum) target)))
2051          ))
2052      (lambda (port target seed)
2053        (loop orig-handlers #f () port target seed))
2054       ))))
2055
2056(run-test
2057 (pp (ssax:make-pi-parser ()))
2058 (pp (ssax:make-pi-parser ((xml . (lambda (port target seed) seed)))))
2059 (pp (ssax:make-pi-parser ((xml . (lambda (port target seed) seed))
2060                           (html . list)
2061                           (*DEFAULT* . ssax:warn))))
2062)
2063
2064; syntax: ssax:make-elem-parser my-new-level-seed my-finish-element
2065;                               my-char-data-handler my-pi-handlers
2066
2067; Create a parser to parse and process one element, including its
2068; character content or children elements. The parser is typically
2069; applied to the root element of a document.
2070
2071; my-new-level-seed
2072;       procedure ELEM-GI ATTRIBUTES NAMESPACES EXPECTED-CONTENT SEED
2073;               where ELEM-GI is a RES-NAME of the element
2074;               about to be processed.
2075;       This procedure is to generate the seed to be passed
2076;       to handlers that process the content of the element.
2077;       This is the function identified as 'fdown' in the denotational
2078;       semantics of the XML parser given in the title comments to this
2079;       file.
2080;
2081; my-finish-element
2082;       procedure ELEM-GI ATTRIBUTES NAMESPACES PARENT-SEED SEED
2083;       This procedure is called when parsing of ELEM-GI is finished.
2084;       The SEED is the result from the last content parser (or
2085;       from my-new-level-seed if the element has the empty content).
2086;       PARENT-SEED is the same seed as was passed to my-new-level-seed.
2087;       The procedure is to generate a seed that will be the result
2088;       of the element parser.
2089;       This is the function identified as 'fup' in the denotational
2090;       semantics of the XML parser given in the title comments to this
2091;       file.
2092;
2093; my-char-data-handler
2094;       A STR-HANDLER
2095;
2096; my-pi-handlers
2097;       See ssax:make-pi-handler above
2098;
2099
2100; The generated parser is a
2101;       procedure START-TAG-HEAD PORT ELEMS ENTITIES
2102;       NAMESPACES PRESERVE-WS? SEED
2103; The procedure must be called after the start tag token has been
2104; read. START-TAG-HEAD is an UNRES-NAME from the start-element tag.
2105; ELEMS is an instance of xml-decl::elems.
2106; See ssax:complete-start-tag::preserve-ws?
2107
2108; Faults detected:
2109;       VC: XML-Spec.html#elementvalid
2110;       WFC: XML-Spec.html#GIMatch
2111
2112
2113(define-syntax ssax:make-elem-parser
2114  (syntax-rules ()
2115    ((ssax:make-elem-parser my-new-level-seed my-finish-element
2116                            my-char-data-handler my-pi-handlers)
2117 
2118   (lambda (start-tag-head port elems entities namespaces
2119                           preserve-ws? seed)
2120
2121     (define xml-space-gi (cons ssax:Prefix-XML
2122                                (string->symbol "space")))
2123
2124     (let handle-start-tag ((start-tag-head start-tag-head)
2125                            (port port) (entities entities)
2126                            (namespaces namespaces)
2127                            (preserve-ws? preserve-ws?) (parent-seed seed))
2128       (let*-values
2129        (((elem-gi attributes namespaces expected-content)
2130          (ssax:complete-start-tag start-tag-head port elems
2131                                   entities namespaces))
2132         ((seed)
2133          (my-new-level-seed elem-gi attributes
2134                              namespaces expected-content parent-seed)))
2135        (case expected-content
2136          ((EMPTY-TAG)
2137           (my-finish-element
2138            elem-gi attributes namespaces parent-seed seed))
2139          ((EMPTY)              ; The end tag must immediately follow
2140           (ssax:assert-token 
2141            (and (eqv? #\< (ssax:skip-S port)) (ssax:read-markup-token port))
2142            'END  start-tag-head
2143            (lambda (token exp-kind exp-head)
2144              (parser-error port "[elementvalid] broken for " token 
2145                     " while expecting "
2146                     exp-kind exp-head)))
2147           (my-finish-element
2148            elem-gi attributes namespaces parent-seed seed))
2149          (else         ; reading the content...
2150           (let ((preserve-ws?  ; inherit or set the preserve-ws? flag
2151                  (cond
2152                   ((assoc xml-space-gi attributes) =>
2153                    (lambda (name-value)
2154                      (equal? "preserve" (cdr name-value))))
2155                   (else preserve-ws?))))
2156             (let loop ((port port) (entities entities)
2157                        (expect-eof? #f) (seed seed))
2158               (let*-values
2159                (((seed term-token)
2160                  (ssax:read-char-data port expect-eof?
2161                                       my-char-data-handler seed)))
2162                (if (eof-object? term-token)
2163                    seed
2164                    (case (xml-token-kind term-token)
2165                      ((END)
2166                       (ssax:assert-token term-token 'END  start-tag-head
2167                          (lambda (token exp-kind exp-head)
2168                            (parser-error port "[GIMatch] broken for "
2169                                   term-token " while expecting "
2170                                   exp-kind exp-head)))
2171                       (my-finish-element
2172                        elem-gi attributes namespaces parent-seed seed))
2173                      ((PI)
2174                       (let ((seed 
2175                          ((ssax:make-pi-parser my-pi-handlers)
2176                           port (xml-token-head term-token) seed)))
2177                         (loop port entities expect-eof? seed)))
2178                      ((ENTITY-REF)
2179                       (let ((seed
2180                              (ssax:handle-parsed-entity
2181                               port (xml-token-head term-token)
2182                               entities
2183                               (lambda (port entities seed)
2184                                 (loop port entities #t seed))
2185                               my-char-data-handler
2186                               seed))) ; keep on reading the content after ent
2187                         (loop port entities expect-eof? seed)))
2188                      ((START)          ; Start of a child element
2189                       (if (eq? expected-content 'PCDATA)
2190                           (parser-error port "[elementvalid] broken for "
2191                                  elem-gi
2192                                  " with char content only; unexpected token "
2193                                  term-token))
2194                           ; Do other validation of the element content
2195                           (let ((seed
2196                                  (handle-start-tag
2197                                     (xml-token-head term-token)
2198                                     port entities namespaces
2199                                     preserve-ws? seed)))
2200                             (loop port entities expect-eof? seed)))
2201                      (else
2202                       (parser-error port "XML [43] broken for "
2203                                     term-token))))))))
2204          )))
2205))))
2206
2207
2208; syntax: ssax:make-parser user-handler-tag user-handler-proc ...
2209;
2210; Create an XML parser, an instance of the XML parsing framework.
2211; This will be a SAX, a DOM, or a specialized parser depending
2212; on the supplied user-handlers.
2213
2214; user-handler-tag is a symbol that identifies a procedural expression
2215; that follows the tag. Given below are tags and signatures of the
2216; corresponding procedures. Not all tags have to be specified. If some
2217; are omitted, reasonable defaults will apply.
2218;
2219
2220; tag: DOCTYPE
2221; handler-procedure: PORT DOCNAME SYSTEMID INTERNAL-SUBSET? SEED
2222; If internal-subset? is #t, the current position in the port
2223; is right after we have read #\[ that begins the internal DTD subset.
2224; We must finish reading of this subset before we return
2225; (or must call skip-internal-subset if we aren't interested in reading it).
2226; The port at exit must be at the first symbol after the whole
2227; DOCTYPE declaration.
2228; The handler-procedure must generate four values:
2229;       ELEMS ENTITIES NAMESPACES SEED
2230; See xml-decl::elems for ELEMS. It may be #f to switch off the validation.
2231; NAMESPACES will typically contain USER-PREFIXes for selected URI-SYMBs.
2232; The default handler-procedure skips the internal subset,
2233; if any, and returns (values #f '() '() seed)
2234
2235; tag: UNDECL-ROOT
2236; handler-procedure: ELEM-GI SEED
2237; where ELEM-GI is an UNRES-NAME of the root element. This procedure
2238; is called when an XML document under parsing contains _no_ DOCTYPE
2239; declaration.
2240; The handler-procedure, as a DOCTYPE handler procedure above,
2241; must generate four values:
2242;       ELEMS ENTITIES NAMESPACES SEED
2243; The default handler-procedure returns (values #f '() '() seed)
2244
2245; tag: DECL-ROOT
2246; handler-procedure: ELEM-GI SEED
2247; where ELEM-GI is an UNRES-NAME of the root element. This procedure
2248; is called when an XML document under parsing does contains the DOCTYPE
2249; declaration.
2250; The handler-procedure must generate a new SEED (and verify
2251; that the name of the root element matches the doctype, if the handler
2252; so wishes).
2253; The default handler-procedure is the identity function.
2254
2255; tag: NEW-LEVEL-SEED
2256; handler-procedure: see ssax:make-elem-parser, my-new-level-seed
2257
2258; tag: FINISH-ELEMENT
2259; handler-procedure: see ssax:make-elem-parser, my-finish-element
2260
2261; tag: CHAR-DATA-HANDLER
2262; handler-procedure: see ssax:make-elem-parser, my-char-data-handler
2263
2264; tag: PI
2265; handler-procedure: see ssax:make-pi-parser
2266; The default value is '()
2267 
2268; The generated parser is a
2269;       procedure PORT SEED
2270
2271; This procedure parses the document prolog and then exits to
2272; an element parser (created by ssax:make-elem-parser) to handle
2273; the rest.
2274;
2275; [1]  document ::=  prolog element Misc*
2276; [22] prolog ::= XMLDecl? Misc* (doctypedec | Misc*)?
2277; [27] Misc ::= Comment | PI |  S
2278;
2279; [28] doctypedecl ::=  '<!DOCTYPE' S Name (S ExternalID)? S?
2280;                       ('[' (markupdecl | PEReference | S)* ']' S?)? '>'
2281; [29] markupdecl ::= elementdecl | AttlistDecl
2282;                      | EntityDecl
2283;                      | NotationDecl | PI
2284;                      | Comment
2285;
2286
2287
2288; This is ssax:make-parser with all the (specialization) handlers given
2289; as positional arguments. It is called by ssax:make-parser, see below
2290(define-syntax ssax:make-parser/positional-args
2291  (syntax-rules ()
2292    ((ssax:make-parser/positional-args
2293       *handler-DOCTYPE
2294       *handler-UNDECL-ROOT
2295       *handler-DECL-ROOT
2296       *handler-NEW-LEVEL-SEED
2297       *handler-FINISH-ELEMENT
2298       *handler-CHAR-DATA-HANDLER
2299       *handler-PI)
2300  (lambda (port seed)
2301
2302     ; We must've just scanned the DOCTYPE token
2303     ; Handle the doctype declaration and exit to
2304     ; scan-for-significant-prolog-token-2, and eventually, to the
2305     ; element parser.
2306     (define (handle-decl port token-head seed)
2307       (or (eq? (string->symbol "DOCTYPE") token-head)
2308           (parser-error port "XML [22], expected DOCTYPE declaration, found "
2309                  token-head))
2310       (assert-curr-char ssax:S-chars "XML [28], space after DOCTYPE" port)
2311       (ssax:skip-S port)
2312       (let*-values
2313        (((docname) (ssax:read-QName port))
2314         ((systemid)
2315          (and (ssax:ncname-starting-char? (ssax:skip-S port))
2316               (ssax:read-external-id port)))
2317         ((internal-subset?)
2318          (begin (ssax:skip-S port)
2319            (eqv? #\[ (assert-curr-char '(#\> #\[)
2320                                        "XML [28], end-of-DOCTYPE" port))))
2321         ((elems entities namespaces seed)
2322          (*handler-DOCTYPE port docname systemid
2323                            internal-subset? seed))
2324         )
2325        (scan-for-significant-prolog-token-2 port elems entities namespaces
2326                                             seed)))
2327
2328
2329     ; Scan the leading PIs until we encounter either a doctype declaration
2330     ; or a start token (of the root element)
2331     ; In the latter two cases, we exit to the appropriate continuation
2332     (define (scan-for-significant-prolog-token-1 port seed)
2333       (let ((token (ssax:scan-Misc port)))
2334         (if (eof-object? token)
2335             (parser-error port "XML [22], unexpected EOF")
2336             (case (xml-token-kind token)
2337               ((PI)
2338                (let ((seed 
2339                       ((ssax:make-pi-parser *handler-PI)
2340                        port (xml-token-head token) seed)))
2341                  (scan-for-significant-prolog-token-1 port seed)))
2342               ((DECL) (handle-decl port (xml-token-head token) seed))
2343               ((START)
2344                (let*-values
2345                 (((elems entities namespaces seed)
2346                   (*handler-UNDECL-ROOT (xml-token-head token) seed)))
2347                 (element-parser (xml-token-head token) port elems
2348                                 entities namespaces #f seed)))
2349               (else (parser-error port "XML [22], unexpected markup "
2350                                   token))))))
2351
2352
2353     ; Scan PIs after the doctype declaration, till we encounter
2354     ; the start tag of the root element. After that we exit
2355     ; to the element parser
2356     (define (scan-for-significant-prolog-token-2 port elems entities
2357                                                  namespaces seed)
2358       (let ((token (ssax:scan-Misc port)))
2359         (if (eof-object? token)
2360             (parser-error port "XML [22], unexpected EOF")
2361             (case (xml-token-kind token)
2362               ((PI)
2363                (let ((seed 
2364                       ((ssax:make-pi-parser *handler-PI)
2365                        port (xml-token-head token) seed)))
2366                  (scan-for-significant-prolog-token-2 port elems entities
2367                                                       namespaces seed)))
2368               ((START)
2369                (element-parser (xml-token-head token) port elems
2370                  entities namespaces #f
2371                  (*handler-DECL-ROOT (xml-token-head token) seed)))
2372               (else (parser-error port "XML [22], unexpected markup "
2373                                   token))))))
2374
2375
2376     ; A procedure start-tag-head port elems entities namespaces
2377     ;           preserve-ws? seed
2378     (define element-parser
2379       (ssax:make-elem-parser *handler-NEW-LEVEL-SEED
2380                              *handler-FINISH-ELEMENT
2381                              *handler-CHAR-DATA-HANDLER
2382                              *handler-PI))
2383
2384     ; Get the ball rolling ...
2385     (scan-for-significant-prolog-token-1 port seed)
2386))))
2387
2388
2389
2390; The following meta-macro turns a regular macro (with positional
2391; arguments) into a form with keyword (labeled) arguments.  We later
2392; use the meta-macro to convert ssax:make-parser/positional-args into
2393; ssax:make-parser. The latter provides a prettier (with labeled
2394; arguments and defaults) interface to
2395; ssax:make-parser/positional-args
2396;
2397; ssax:define-labeled-arg-macro LABELED-ARG-MACRO-NAME
2398;               (POS-MACRO-NAME ARG-DESCRIPTOR ...)
2399; expands into the definition of a macro
2400;       LABELED-ARG-MACRO-NAME KW-NAME KW-VALUE KW-NAME1 KW-VALUE1 ...
2401; which, in turn, expands into
2402;       POS-MACRO-NAME ARG1 ARG2 ...
2403; where each ARG1 etc. comes either from KW-VALUE or from
2404; the deafult part of ARG-DESCRIPTOR. ARG1 corresponds to the first
2405; ARG-DESCRIPTOR, ARG2 corresponds to the second descriptor, etc.
2406; Here ARG-DESCRIPTOR describes one argument of the positional macro.
2407; It has the form
2408;       (ARG-NAME DEFAULT-VALUE)
2409; or
2410;       (ARG-NAME)
2411; In the latter form, the default value is not given, so that the invocation of
2412; LABELED-ARG-MACRO-NAME must mention the corresponding parameter.
2413; ARG-NAME can be anything: an identifier, a string, or even a number.
2414
2415
2416(define-syntax ssax:define-labeled-arg-macro
2417  (syntax-rules ()
2418    ((ssax:define-labeled-arg-macro
2419       labeled-arg-macro-name
2420       (positional-macro-name
2421         (arg-name . arg-def) ...))
2422      (define-syntax labeled-arg-macro-name
2423        (syntax-rules ()
2424          ((labeled-arg-macro-name . kw-val-pairs)
2425            (letrec-syntax
2426              ((find 
2427                 (syntax-rules (arg-name ...)
2428                   ((find k-args (arg-name . default) arg-name
2429                      val . others)        ; found arg-name among kw-val-pairs
2430                    (next val . k-args)) ...
2431                   ((find k-args key arg-no-match-name val . others)
2432                     (find k-args key . others))
2433                   ((find k-args (arg-name default)) ; default must be here
2434                     (next default . k-args)) ...
2435                   ))
2436                (next                   ; pack the continuation to find
2437                  (syntax-rules ()
2438                    ((next val vals key . keys)
2439                      (find ((val . vals) . keys) key . kw-val-pairs))
2440                    ((next val vals)    ; processed all arg-descriptors
2441                      (rev-apply (val) vals))))
2442                (rev-apply
2443                  (syntax-rules ()
2444                    ((rev-apply form (x . xs))
2445                      (rev-apply (x . form) xs))
2446                    ((rev-apply form ()) form))))
2447              (next positional-macro-name () 
2448                (arg-name . arg-def) ...))))))))
2449
2450
2451; The definition of ssax:make-parser
2452(ssax:define-labeled-arg-macro ssax:make-parser
2453  (ssax:make-parser/positional-args
2454    (DOCTYPE
2455      (lambda (port docname systemid internal-subset? seed)
2456        (when internal-subset?
2457          (ssax:warn port "Internal DTD subset is not currently handled ")
2458          (ssax:skip-internal-dtd port))
2459        (ssax:warn port "DOCTYPE DECL " docname " " 
2460          systemid " found and skipped")
2461        (values #f '() '() seed)
2462        ))
2463    (UNDECL-ROOT
2464      (lambda (elem-gi seed) (values #f '() '() seed)))
2465    (DECL-ROOT
2466      (lambda (elem-gi seed) seed))
2467    (NEW-LEVEL-SEED)            ; required
2468    (FINISH-ELEMENT)            ; required
2469    (CHAR-DATA-HANDLER)         ; required
2470    (PI ())
2471    ))
2472
2473(run-test
2474 (letrec ((simple-parser
2475           (lambda (str doctype-fn)
2476             (call-with-input-string str
2477                 (lambda (port)
2478                   ((ssax:make-parser
2479                     NEW-LEVEL-SEED 
2480                     (lambda (elem-gi attributes namespaces
2481                                      expected-content seed)
2482                       '())
2483   
2484                     FINISH-ELEMENT
2485                     (lambda (elem-gi attributes namespaces parent-seed seed)
2486                       (let
2487                           ((seed (if (null? namespaces) (reverse seed)
2488                                      (cons (list '*NAMESPACES* namespaces)
2489                                            (reverse seed)))))
2490                         (let ((seed (if (attlist-null? attributes) seed
2491                                         (cons
2492                                          (cons '@ 
2493                                           (map (lambda (attr)
2494                                              (list (car attr) (cdr attr)))
2495                                                (attlist->alist attributes)))
2496                                          seed))))
2497                           (cons (cons elem-gi seed) parent-seed))))
2498
2499                     CHAR-DATA-HANDLER
2500                     (lambda (string1 string2 seed)
2501                       (if (string-null? string2) (cons string1 seed)
2502                           (cons* string2 string1 seed)))
2503
2504                     DOCTYPE
2505                     (lambda (port docname systemid internal-subset? seed)
2506                       (when internal-subset?
2507                          (ssax:warn port
2508                            "Internal DTD subset is not currently handled ")
2509                          (ssax:skip-internal-dtd port))
2510                       (ssax:warn port "DOCTYPE DECL " docname " "
2511                             systemid " found and skipped")
2512                       (doctype-fn docname seed))
2513
2514                     UNDECL-ROOT
2515                     (lambda (elem-gi seed)
2516                       (doctype-fn elem-gi seed))
2517                     )
2518                    port '())))))
2519
2520          (dummy-doctype-fn (lambda (elem-gi seed) (values #f '() '() seed)))
2521          (test
2522           (lambda (str doctype-fn expected)
2523             (cout nl "Parsing: " str nl)
2524             (let ((result (simple-parser (unesc-string str) doctype-fn)))
2525               (write result)
2526               (assert (equal? result expected)))))
2527          )
2528
2529   (test "<BR/>" dummy-doctype-fn '(('"BR")))
2530   (assert (failed? (test "<BR>" dummy-doctype-fn '())))
2531   (test "<BR></BR>" dummy-doctype-fn '(('"BR")))
2532   (assert (failed? (test "<BR></BB>" dummy-doctype-fn '())))
2533
2534   (test "   <A HREF='URL'> link <I>itlink </I> &amp;amp;</A>"
2535         dummy-doctype-fn 
2536         '(('"A" (@ ('"HREF" "URL")) " link " ('"I" "itlink ")
2537            " " "&" "amp;")))
2538
2539   (test
2540      "   <A HREF='URL' xml:space='preserve'> link <I>itlink </I> &amp;amp;</A>" dummy-doctype-fn 
2541      '(('"A" (@ ('"HREF" "URL") (('"xml" . '"space") "preserve"))
2542           " link " ('"I" "itlink ") " " "&" "amp;")))
2543
2544   (test "   <A HREF='URL' xml:space='preserve'> link <I xml:space='default'>itlink </I> &amp;amp;</A>" dummy-doctype-fn
2545         '(('"A" (@ ('"HREF" "URL") (('"xml" . '"space") "preserve"))
2546              " link "
2547              ('"I" (@ (('"xml" . '"space") "default")) "itlink ")
2548              " " "&" "amp;")))
2549   (test "<itemize><item>This   is item 1 </item>%n<!-- Just:a comment --><item>Item 2</item>%n </itemize>" dummy-doctype-fn 
2550         `(('"itemize" ('"item" "This   is item 1 ")
2551            ,(unesc-string "%n") ('"item" "Item 2") ,(unesc-string "%n "))))
2552  (test " <P><![CDATA[<BR>%n<![CDATA[<BR>]]&gt;]]></P>"
2553        dummy-doctype-fn  `(('"P" "<BR>" ,nl "<![CDATA[<BR>" "]]" "" ">")))
2554
2555  (test " <P><![CDATA[<BR>%r<![CDATA[<BR>]]&gt;]]></P>"
2556        dummy-doctype-fn `(('"P" "<BR>" ,nl "<![CDATA[<BR>" "]]" "" ">")))
2557
2558  (test "<?xml version='1.0'?>%n%n<Reports TStamp='1'></Reports>"
2559        dummy-doctype-fn '(('"Reports" (@ ('"TStamp" "1")))))
2560  (test "%n<?PI xxx?><!-- Comment %n -%r-->%n<?PI1 zzz?><T/>" 
2561        dummy-doctype-fn '(('"T")))
2562  (test "<!DOCTYPE T SYSTEM 'system1' ><!-- comment -->%n<T/>"
2563        (lambda (elem-gi seed) (assert (equal? elem-gi ''"T"))
2564                (values #f '() '() seed))
2565        '(('"T")))
2566  (test "<!DOCTYPE T PUBLIC '//EN/T' \"system1\" [ <!ELEMENT a 'aa'> ]>%n<?pi?><T/>" 
2567        (lambda (elem-gi seed) (assert (equal? elem-gi ''"T"))
2568                (values #f '() '() seed))
2569        '(('"T")))
2570  (test "<BR/>"
2571        (lambda (elem-gi seed)
2572          (values '(('"BR" EMPTY ())) '() '() seed)) '(('"BR")))
2573  (test "<BR></BR>"
2574        (lambda (elem-gi seed)
2575          (values '(('"BR" EMPTY ())) '() '() seed)) '(('"BR")))
2576  (assert (failed? (test "<BR>aa</BR>"
2577        (lambda (elem-gi seed)
2578          (values '(('"BR" EMPTY ())) '() '() seed)) '())))
2579  (test "<BR>aa</BR>"
2580        (lambda (elem-gi seed)
2581          (values '(('"BR" PCDATA ())) '() '() seed)) '(('"BR" "aa")))
2582  (assert (failed? (test "<BR>a<I>a</I></BR>"
2583        (lambda (elem-gi seed)
2584          (values '(('"BR" PCDATA ())) '() '() seed)) '())))
2585  (test "<BR>a<I>a</I></BR>"
2586        (lambda (elem-gi seed)
2587          (values '(('"BR" ANY ()) ('"I" PCDATA ())) '() '() seed))
2588          '(('"BR" "a" ('"I" "a"))))
2589
2590
2591  (test "<DIV>Example: \"&example;\"</DIV>"
2592        (lambda (elem-gi seed)
2593          (values #f '((example . "<P>An    ampersand (&#38;) may   be escaped numerically (&#38;#38;) or with a general entity (&amp;amp;).</P>")) '() seed))
2594        '(('"DIV" "Example: \""
2595           ('"P" "An    ampersand (" "&" ") may   be escaped numerically (" "&" "#38;) or with a general entity (" "&" "amp;).") "\"")))
2596 (test "<DIV>Example: \"&example;\" <P/></DIV>"
2597        (lambda (elem-gi seed)
2598          (values #f '(('"quote" . "<I>example:</I> ex")
2599                       ('"example" . "<Q>&quote;!</Q>?")) '() seed))
2600          '(('"DIV" "Example: \"" ('"Q" ('"I" "example:") " ex" "!") "?"
2601                 "\" "  ('"P"))))
2602 (assert (failed?
2603   (test "<DIV>Example: \"&example;\" <P/></DIV>"
2604        (lambda (elem-gi seed)
2605          (values #f '(('"quote" . "<I>example:")
2606                       ('"example" . "<Q>&quote;</I>!</Q>?")) '() seed))
2607        '())))
2608
2609 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2610        (lambda (elem-gi seed)
2611          (values #f '() '() seed))
2612       '((('"URI1" . '"DIV") (@ ('"B" "B") (('"URI1" . '"B") "A"))
2613          (*NAMESPACES* (('"A" '"URI1" . '"URI1")
2614                         (*DEFAULT* '"URI1" . '"URI1")))
2615          (('"URI1" . '"P")
2616           (*NAMESPACES* ((*DEFAULT* #f . #f) ('"A" '"URI1" . '"URI1")
2617                          (*DEFAULT* '"URI1" . '"URI1")))
2618           ('"BR"
2619            (*NAMESPACES* ((*DEFAULT* #f . #f)
2620                           ('"A" '"URI1" . '"URI1")
2621                           (*DEFAULT* '"URI1" . '"URI1"))))))))
2622 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2623        (lambda (elem-gi seed)
2624          (values #f '() '((#f '"UA" . '"URI1")) seed))
2625       '((('"UA" . '"DIV") (@ ('"B" "B") (('"UA" . '"B") "A"))
2626          (*NAMESPACES* (('"A" '"UA" . '"URI1")
2627                         (*DEFAULT* '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2628          (('"UA" . '"P")
2629           (*NAMESPACES* ((*DEFAULT* #f . #f) ('"A" '"UA" . '"URI1")
2630                          (*DEFAULT* '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2631           ('"BR"
2632            (*NAMESPACES* ((*DEFAULT* #f . #f) ('"A" '"UA" . '"URI1")
2633                           (*DEFAULT* '"UA" . '"URI1")
2634                           (#f '"UA" . '"URI1"))))))))
2635 ; uniqattr should fail
2636 (assert (failed?
2637 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2638        (lambda (elem-gi seed)
2639          (values
2640           `(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2641                       (('"A" . '"B") CDATA IMPLIED #f)
2642                       (('"C" . '"B") CDATA IMPLIED "xx")
2643                       (('"xmlns" . '"C") CDATA IMPLIED "URI1")
2644                       ))
2645             (('"A" . '"P") ANY ()) ('"BR" '"EMPTY" ()))
2646           '() '((#f '"UA" . '"URI1")) seed))
2647        '())))
2648 ; prefix C undeclared
2649 (assert (failed?
2650 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2651        (lambda (elem-gi seed)
2652          (values
2653           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2654                       ('"xmlns"  CDATA IMPLIED "URI1")
2655                       (('"A" . '"B") CDATA IMPLIED #f)
2656                       (('"C" . '"B") CDATA IMPLIED "xx")
2657                       ))
2658             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2659           '() '((#f '"UA" . '"URI1")) seed))
2660        '())))
2661
2662 ; contradiction to xmlns declaration
2663 (assert (failed?
2664 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2665        (lambda (elem-gi seed)
2666          (values
2667           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2668                       ('"xmlns"  CDATA FIXED "URI2")
2669                       (('"A" . '"B") CDATA IMPLIED #f)
2670                       ))
2671             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2672           '() '((#f '"UA" . '"URI1")) seed))
2673        '())))
2674
2675 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2676        (lambda (elem-gi seed)
2677          (values
2678           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2679                       ('"xmlns"  CDATA FIXED "URI1")
2680                       (('"A" . '"B") CDATA IMPLIED #f)
2681                       ))
2682             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2683           '() '((#f '"UA" . '"URI1")) seed))
2684       '((('"UA" . '"DIV") (@ ('"B" "B") (('"UA" . '"B") "A"))
2685          (*NAMESPACES* ((*DEFAULT* '"UA" . '"URI1")
2686                         ('"A" '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2687          (('"UA" . '"P")
2688           (*NAMESPACES* ((*DEFAULT* #f . #f) 
2689                          (*DEFAULT* '"UA" . '"URI1")
2690                          ('"A" '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2691           ('"BR"
2692            (*NAMESPACES* ((*DEFAULT* #f . #f) (*DEFAULT* '"UA" . '"URI1")
2693                           ('"A" '"UA" . '"URI1") (#f '"UA" . '"URI1"))))))))
2694
2695 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2696        (lambda (elem-gi seed)
2697          (values
2698           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2699                          (('"A" . '"B") CDATA IMPLIED #f)
2700                          (('"C" . '"B") CDATA IMPLIED "xx")
2701                          (('"xmlns" . '"C") CDATA IMPLIED "URI2")
2702                       ))
2703             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2704           '() '((#f '"UA" . '"URI1")) seed))
2705        '((('"UA" . '"DIV") (@ ('"B" "B") (('"UA" . '"B") "A")
2706                               (('"URI2" . '"B") "xx"))
2707           (*NAMESPACES* ((*DEFAULT* '"UA" . '"URI1")
2708                          ('"A" '"UA" . '"URI1")
2709                          ('"C" '"URI2" . '"URI2")
2710                          (#f '"UA" . '"URI1")))
2711           (('"UA" . '"P")
2712            (*NAMESPACES* ((*DEFAULT* #f . #f) (*DEFAULT* '"UA" . '"URI1")
2713                           ('"A" '"UA" . '"URI1")
2714                           ('"C" '"URI2" . '"URI2") (#f '"UA" . '"URI1")))
2715            ('"BR" 
2716             (*NAMESPACES* ((*DEFAULT* #f . #f)
2717                            (*DEFAULT* '"UA" . '"URI1")
2718                            ('"A" '"UA" . '"URI1")
2719                            ('"C" '"URI2" . '"URI2")
2720                            (#f '"UA" . '"URI1"))))))))
2721))
2722
2723   
2724
2725;========================================================================
2726;               Highest-level parsers: XML to SXML
2727;
2728
2729; First, a few utility procedures that turned out useful
2730
2731; procedure: ssax:reverse-collect-str LIST-OF-FRAGS -> LIST-OF-FRAGS
2732; given the list of fragments (some of which are text strings)
2733; reverse the list and concatenate adjacent text strings.
2734; We can prove from the general case below that if LIST-OF-FRAGS
2735; has zero or one element, the result of the procedure is equal?
2736; to its argument. This fact justifies the shortcut evaluation below.
2737(define (ssax:reverse-collect-str fragments)
2738  (cond
2739    ((null? fragments) '())     ; a shortcut
2740    ((null? (cdr fragments)) fragments) ; see the comment above
2741    (else
2742      (let loop ((fragments fragments) (result '()) (strs '()))
2743        (cond
2744          ((null? fragments)
2745            (if (null? strs) result
2746              (cons (string-concatenate/shared strs) result)))
2747          ((string? (car fragments))
2748            (loop (cdr fragments) result (cons (car fragments) strs)))
2749          (else
2750            (loop (cdr fragments)
2751              (cons
2752                (car fragments)
2753                (if (null? strs) result
2754                  (cons (string-concatenate/shared strs) result)))
2755              '())))))))
2756
2757
2758;     ssax:reverse-collect-str-drop-ws LIST-OF-FRAGS -> LIST-OF-FRAGS
2759; given the list of fragments (some of which are text strings)
2760; reverse the list and concatenate adjacent text strings.
2761; We also drop "unsignificant" whitespace, that is, whitespace
2762; in front, behind and between elements. The whitespace that
2763; is included in character data is not affected.
2764; We use this procedure to "intelligently" drop "insignificant"
2765; whitespace in the parsed SXML. If the strict compliance with
2766; the XML Recommendation regarding the whitespace is desired, please
2767; use the ssax:reverse-collect-str procedure instead.
2768
2769(define (ssax:reverse-collect-str-drop-ws fragments)
2770  (cond
2771    ((null? fragments) '())             ; a shortcut
2772    ((null? (cdr fragments))            ; another shortcut
2773     (if (and (string? (car fragments)) (string-whitespace? (car fragments)))
2774       '() fragments))                  ; remove trailing ws
2775    (else
2776      (let loop ((fragments fragments) (result '()) (strs '())
2777                  (all-whitespace? #t))
2778        (cond
2779          ((null? fragments)
2780            (if all-whitespace? result  ; remove leading ws
2781              (cons (string-concatenate/shared strs) result)))
2782          ((string? (car fragments))
2783            (loop (cdr fragments) result (cons (car fragments) strs)
2784              (and all-whitespace?
2785                (string-whitespace? (car fragments)))))
2786          (else
2787            (loop (cdr fragments)
2788              (cons
2789                (car fragments)
2790                (if all-whitespace? result
2791                  (cons (string-concatenate/shared strs) result)))
2792              '() #t)))))))
2793
2794
2795; procedure: ssax:xml->sxml PORT NAMESPACE-PREFIX-ASSIG
2796;
2797; This is an instance of a SSAX parser above that returns an SXML
2798; representation of the XML document to be read from PORT.
2799; NAMESPACE-PREFIX-ASSIG is a list of (USER-PREFIX . URI-STRING)
2800; that assigns USER-PREFIXes to certain namespaces identified by
2801; particular URI-STRINGs. It may be an empty list.
2802; The procedure returns an SXML tree. The port points out to the
2803; first character after the root element.
2804
2805(define (ssax:xml->sxml port namespace-prefix-assig)
2806  (letrec
2807      ((namespaces
2808        (map (lambda (el)
2809               (cons* #f (car el) (ssax:uri-string->symbol (cdr el))))
2810             namespace-prefix-assig))
2811
2812       (RES-NAME->SXML
2813        (lambda (res-name)
2814          (string->symbol
2815           (string-append
2816            (symbol->string (car res-name))
2817            ":"
2818            (symbol->string (cdr res-name))))))
2819
2820       )
2821    (let ((result
2822           (reverse
2823            ((ssax:make-parser
2824             NEW-LEVEL-SEED 
2825             (lambda (elem-gi attributes namespaces
2826                              expected-content seed)
2827               '())
2828   
2829             FINISH-ELEMENT
2830             (lambda (elem-gi attributes namespaces parent-seed seed)
2831               (let ((seed (ssax:reverse-collect-str-drop-ws seed))
2832                     (attrs
2833                      (attlist-fold
2834                       (lambda (attr accum)
2835                         (cons (list
2836                                (if (symbol? (car attr)) (car attr)
2837                                    (RES-NAME->SXML (car attr)))
2838                                (cdr attr)) accum))
2839                       '() attributes)))
2840                 (cons
2841                  (cons
2842                   (if (symbol? elem-gi) elem-gi
2843                       (RES-NAME->SXML elem-gi))
2844                   (if (null? attrs) seed
2845                       (cons (cons '@ attrs) seed)))
2846                  parent-seed)))
2847
2848             CHAR-DATA-HANDLER
2849             (lambda (string1 string2 seed)
2850               (if (string-null? string2) (cons string1 seed)
2851                   (cons* string2 string1 seed)))
2852
2853             DOCTYPE
2854             (lambda (port docname systemid internal-subset? seed)
2855               (when internal-subset?
2856                     (ssax:warn port
2857                           "Internal DTD subset is not currently handled ")
2858                     (ssax:skip-internal-dtd port))
2859               (ssax:warn port "DOCTYPE DECL " docname " "
2860                     systemid " found and skipped")
2861               (values #f '() namespaces seed))
2862
2863             UNDECL-ROOT
2864             (lambda (elem-gi seed)
2865               (values #f '() namespaces seed))
2866
2867             PI
2868             ((*DEFAULT* .
2869                (lambda (port pi-tag seed)
2870                  (cons
2871                   (list '*PI* pi-tag (ssax:read-pi-body-as-string port))
2872                   seed))))
2873             )
2874            port '()))))
2875      (cons '*TOP*
2876            (if (null? namespace-prefix-assig) result
2877                (cons
2878                 (list '@ (cons '*NAMESPACES* 
2879                                 (map (lambda (ns) (list (car ns) (cdr ns)))
2880                                      namespace-prefix-assig)))
2881                      result)))
2882)))
2883
2884; a few lines of validation code
2885(run-test (letrec
2886    ((test (lambda (str namespace-assig expected-res)
2887          (newline) (display "input: ")
2888          (write (unesc-string str)) (newline) (display "Result: ")
2889          (let ((result
2890                 (call-with-input-string (unesc-string str)
2891                     (lambda (port)
2892                       (ssax:xml->sxml port namespace-assig)))))
2893            (pp result)
2894            (assert (equal_? result expected-res))))))
2895
2896    (test " <BR/>" '() '(*TOP* (BR)))
2897    (test "<BR></BR>" '() '(*TOP* (BR)))
2898    (test " <BR CLEAR='ALL'%nCLASS='Class1'/>" '()
2899          '(*TOP* (BR (@ (CLEAR "ALL") (CLASS "Class1")))))
2900    (test "   <A HREF='URL'>  link <I>itlink </I> &amp;amp;</A>" '()
2901          '(*TOP* (A (@ (HREF "URL")) "  link " (I "itlink ") " &amp;")))
2902    (test "   <A HREF='URL' xml:space='preserve'>  link <I>itlink </I> &amp;amp;</A>" '()
2903          '(*TOP* (A (@ (xml:space "preserve") (HREF "URL"))
2904                     "  link " (I "itlink ") " &amp;")))
2905    (test "   <A HREF='URL' xml:space='preserve'>  link <I xml:space='default'>itlink </I> &amp;amp;</A>" '()
2906          '(*TOP* (A (@ (xml:space "preserve") (HREF "URL"))
2907                     "  link " (I (@ (xml:space "default"))
2908                                  "itlink ") " &amp;")))
2909    (test " <P><?pi1  p1 content ?>?<?pi2 pi2? content? ??></P>" '()
2910          '(*TOP* (P (*PI* pi1 "p1 content ") "?"
2911                     (*PI* pi2 "pi2? content? ?"))))
2912    (test " <P>some text <![CDATA[<]]>1%n&quot;<B>strong</B>&quot;%r</P>"
2913          '()
2914          `(*TOP* (P ,(unesc-string "some text <1%n\"")
2915                      (B "strong") ,(unesc-string "\"%n"))))
2916    (test " <P><![CDATA[<BR>%n<![CDATA[<BR>]]&gt;]]></P>" '()
2917          `(*TOP* (P ,(unesc-string "<BR>%n<![CDATA[<BR>]]>"))))
2918;    (test "<T1><T2>it&apos;s%r%nand   that%n</T2>%r%n%r%n%n</T1>" '()
2919;         '(*TOP* (T1 (T2 "it's%nand   that%n") "%n%n%n")))
2920    (test "<T1><T2>it&apos;s%r%nand   that%n</T2>%r%n%r%n%n</T1>" '()
2921          `(*TOP* (T1 (T2 ,(unesc-string "it's%nand   that%n")))))
2922    (test "<T1><T2>it&apos;s%rand   that%n</T2>%r%n%r%n%n</T1>" '()
2923          `(*TOP* (T1 (T2 ,(unesc-string "it's%nand   that%n")))))
2924    (test "<!DOCTYPE T SYSTEM 'system1' ><!-- comment -->%n<T/>" '()
2925          '(*TOP* (T)))
2926    (test "<?xml version='1.0'?>%n<WEIGHT unit=\"pound\">%n<NET certified='certified'> 67 </NET>%n<GROSS> 95 </GROSS>%n</WEIGHT>" '()
2927          '(*TOP* (*PI* xml "version='1.0'") (WEIGHT (@ (unit "pound"))
2928                (NET (@ (certified "certified")) " 67 ")
2929                (GROSS " 95 "))
2930                  ))
2931;     (test "<?xml version='1.0'?>%n<WEIGHT unit=\"pound\">%n<NET certified='certified'> 67 </NET>%n<GROSS> 95 </GROSS>%n</WEIGHT>" '()
2932;         '(*TOP* (*PI* xml "version='1.0'") (WEIGHT (@ (unit "pound"))
2933;                "%n" (NET (@ (certified "certified")) " 67 ")
2934;                "%n" (GROSS " 95 ") "%n")
2935;                 ))
2936    (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>" '()
2937          '(*TOP* (URI1:DIV (@ (URI1:B "A") (B "B")) (URI1:P (BR)))))
2938    (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>" '((UA . "URI1"))
2939          '(*TOP* (@ (*NAMESPACES* (UA "URI1")))
2940                  (UA:DIV (@ (UA:B "A") (B "B")) (UA:P (BR)))))
2941
2942    ; A few tests from XML Namespaces Recommendation
2943    (test (string-append
2944           "<x xmlns:edi='http://ecommerce.org/schema'>"
2945           "<!-- the 'taxClass' attribute's  ns http://ecommerce.org/schema -->"
2946           "<lineItem edi:taxClass='exempt'>Baby food</lineItem>" nl
2947           "</x>") '()
2948           '(*TOP* 
2949             (x (lineItem
2950                 (@ (http://ecommerce.org/schema:taxClass "exempt"))
2951            "Baby food"))))
2952    (test (string-append
2953           "<x xmlns:edi='http://ecommerce.org/schema'>"
2954           "<!-- the 'taxClass' attribute's  ns http://ecommerce.org/schema -->"
2955           "<lineItem edi:taxClass='exempt'>Baby food</lineItem>"
2956           "</x>") '((EDI . "http://ecommerce.org/schema"))
2957           '(*TOP*
2958             (@ (*NAMESPACES* (EDI "http://ecommerce.org/schema")))
2959             (x (lineItem
2960                 (@ (EDI:taxClass "exempt"))
2961            "Baby food"))))
2962
2963    (test (string-append
2964           "<bk:book xmlns:bk='urn:loc.gov:books' "
2965                     "xmlns:isbn='urn:ISBN:0-395-36341-6'>"
2966           "<bk:title>Cheaper by the Dozen</bk:title>"
2967           "<isbn:number>1568491379</isbn:number></bk:book>")
2968          '()
2969          '(*TOP* (urn:loc.gov:books:book
2970                   (urn:loc.gov:books:title "Cheaper by the Dozen")
2971                   (urn:ISBN:0-395-36341-6:number "1568491379"))))
2972
2973    (test (string-append
2974           "<!-- initially, the default namespace is 'books' -->"
2975           "<book xmlns='urn:loc.gov:books' "
2976           "xmlns:isbn='urn:ISBN:0-395-36341-6'>"
2977           "<title>Cheaper by the Dozen</title>"
2978           "<isbn:number>1568491379</isbn:number>"
2979           "<notes>"
2980           "<!-- make HTML the default namespace for some commentary -->"
2981           "<p xmlns='urn:w3-org-ns:HTML'>"
2982           "This is a <i>funny</i> book!"
2983            "</p>"
2984            "</notes>"
2985            "</book>") '()
2986            '(*TOP* (urn:loc.gov:books:book
2987                   (urn:loc.gov:books:title "Cheaper by the Dozen")
2988                   (urn:ISBN:0-395-36341-6:number "1568491379")
2989                   (urn:loc.gov:books:notes
2990                    (urn:w3-org-ns:HTML:p 
2991                     "This is a " (urn:w3-org-ns:HTML:i "funny")
2992                     " book!")))))
2993
2994    (test (string-append
2995           "<Beers>"
2996           "<!-- the default namespace is now that of HTML -->"
2997           "<table xmlns='http://www.w3.org/TR/REC-html40'>"
2998           "<th><td>Name</td><td>Origin</td><td>Description</td></th>"
2999           "<tr>"
3000           "<!-- no default namespace inside table cells -->"
3001           "<td><brandName xmlns=\"\">Huntsman</brandName></td>"
3002           "<td><origin xmlns=''>Bath, UK</origin></td>"
3003           "<td>"
3004              "<details xmlns=''><class>Bitter</class><hop>Fuggles</hop>"
3005              "<pro>Wonderful hop, light alcohol, good summer beer</pro>"
3006              "<con>Fragile; excessive variance pub to pub</con>"
3007              "</details>"
3008           "</td>"
3009           "</tr>"
3010           "</table>"
3011           "</Beers>")
3012              '((html . "http://www.w3.org/TR/REC-html40"))
3013              '(*TOP*
3014                (@ (*NAMESPACES* (html "http://www.w3.org/TR/REC-html40")))
3015                (Beers (html:table
3016                (html:th (html:td "Name")
3017                         (html:td "Origin")
3018                         (html:td "Description"))
3019                (html:tr (html:td (brandName "Huntsman"))
3020                         (html:td (origin "Bath, UK"))
3021                         (html:td 
3022                          (details 
3023                           (class "Bitter")
3024                        (hop "Fuggles")
3025                        (pro "Wonderful hop, light alcohol, good summer beer")
3026                        (con "Fragile; excessive variance pub to pub"))))))))
3027
3028    (test (string-append
3029       "<!-- 1 --><RESERVATION xmlns:HTML='http://www.w3.org/TR/REC-html40'>"
3030       "<!-- 2 --><NAME HTML:CLASS=\"largeSansSerif\">Layman, A</NAME>"
3031       "<!-- 3 --><SEAT CLASS='Y' HTML:CLASS=\"largeMonotype\">33B</SEAT>"
3032       "<!-- 4 --><HTML:A HREF='/cgi-bin/ResStatus'>Check Status</HTML:A>"
3033       "<!-- 5 --><DEPARTURE>1997-05-24T07:55:00+1</DEPARTURE></RESERVATION>")
3034          '((HTML . "http://www.w3.org/TR/REC-html40"))
3035          '(*TOP*
3036            (@ (*NAMESPACES* (HTML "http://www.w3.org/TR/REC-html40")))
3037             (RESERVATION
3038              (NAME (@ (HTML:CLASS "largeSansSerif")) "Layman, A")
3039              (SEAT (@ (HTML:CLASS "largeMonotype") (CLASS "Y")) "33B")
3040              (HTML:A (@ (HREF "/cgi-bin/ResStatus")) "Check Status")
3041              (DEPARTURE "1997-05-24T07:55:00+1"))))
3042    ; Part of RDF from the XML Infoset
3043        (test (string-concatenate/shared (list-intersperse '(
3044   "<?xml version='1.0' encoding='utf-8' standalone='yes'?>"
3045   "<!-- this can be decoded as US-ASCII or iso-8859-1 as well,"
3046   "  since it contains no characters outside the US-ASCII repertoire -->"
3047   "<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'"
3048   "         xmlns:rdfs='http://www.w3.org/2000/01/rdf-schema#'"
3049   "          xmlns='http://www.w3.org/2001/02/infoset#'>"
3050   "<rdfs:Class ID='Boolean'/>"
3051   "<Boolean ID='Boolean.true'/>"
3052   "<Boolean ID='Boolean.false'/>"
3053   "<!--Info item classes-->"
3054   "<rdfs:Class ID='InfoItem'/>"
3055   "<rdfs:Class ID='Document' rdfs:subClassOf='#InfoItem'/>"
3056   "<rdfs:Class ID='Element' rdfs:subClassOf='#InfoItem'/>"
3057   "<rdfs:Class ID='Attribute' rdfs:subClassOf='#InfoItem'/>"
3058   "<rdfs:Class ID='InfoItemSet'
3059      rdfs:subClassOf='http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag'/>"
3060   "<rdfs:Class ID='AttributeSet' rdfs:subClassOf='#InfoItemSet'/>"
3061   "<!--Info item properties-->"
3062   "<rdfs:Property ID='allDeclarationsProcessed'>"
3063   "<rdfs:domain resource='#Document'/>"
3064   "<rdfs:range resource='#Boolean'/></rdfs:Property>"
3065   "<rdfs:Property ID='attributes'>"
3066   "<rdfs:domain resource='#Element'/>"
3067   "<rdfs:range resource='#AttributeSet'/>"
3068   "</rdfs:Property>"
3069   "</rdf:RDF>")
3070   (string #\newline)))
3071   '((RDF . "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3072     (RDFS . "http://www.w3.org/2000/01/rdf-schema#")
3073     (ISET . "http://www.w3.org/2001/02/infoset#"))
3074   '(*TOP* (@ (*NAMESPACES*
3075         (RDF "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3076         (RDFS "http://www.w3.org/2000/01/rdf-schema#")
3077         (ISET "http://www.w3.org/2001/02/infoset#")))
3078       (*PI* xml "version='1.0' encoding='utf-8' standalone='yes'")
3079       (RDF:RDF
3080        (RDFS:Class (@ (ID "Boolean")))
3081        (ISET:Boolean (@ (ID "Boolean.true")))
3082        (ISET:Boolean (@ (ID "Boolean.false")))
3083        (RDFS:Class (@ (ID "InfoItem")))
3084        (RDFS:Class (@ (RDFS:subClassOf "#InfoItem") (ID "Document")))
3085        (RDFS:Class (@ (RDFS:subClassOf "#InfoItem") (ID "Element")))
3086        (RDFS:Class (@ (RDFS:subClassOf "#InfoItem") (ID "Attribute")))
3087        (RDFS:Class
3088         (@ (RDFS:subClassOf
3089             "http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag")
3090            (ID "InfoItemSet")))
3091        (RDFS:Class
3092         (@ (RDFS:subClassOf "#InfoItemSet") (ID "AttributeSet")))
3093        (RDFS:Property
3094         (@ (ID "allDeclarationsProcessed"))
3095         (RDFS:domain (@ (resource "#Document")))
3096         (RDFS:range (@ (resource "#Boolean"))))
3097        (RDFS:Property
3098         (@ (ID "attributes"))
3099         (RDFS:domain (@ (resource "#Element")))
3100         (RDFS:range (@ (resource "#AttributeSet")))))))
3101         
3102    ; Part of RDF from RSS of the Daemon News Mall
3103        (test (string-concatenate/shared (list-intersperse '(
3104  "<?xml version='1.0'?><rdf:RDF "
3105    "xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' "
3106     "xmlns='http://my.netscape.com/rdf/simple/0.9/'>"
3107     "<channel>"
3108     "<title>Daemon News Mall</title>"
3109     "<link>http://mall.daemonnews.org/</link>"
3110     "<description>Central source for all your BSD needs</description>"
3111     "</channel>"
3112     "<item>"
3113     "<title>Daemon News Jan/Feb Issue NOW Available! Subscribe $24.95</title>"
3114     "<link>http://mall.daemonnews.org/?page=shop/flypage&amp;product_id=880</link>"
3115     "</item>"
3116     "<item>"
3117     "<title>The Design and Implementation of the 4.4BSD Operating System $54.95</title>"
3118     "<link>http://mall.daemonnews.org/?page=shop/flypage&amp;product_id=912&amp;category_id=1761</link>"
3119     "</item>"
3120     "</rdf:RDF>")
3121   (string #\newline)
3122   ))
3123   '((RDF . "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3124     (RSS . "http://my.netscape.com/rdf/simple/0.9/")
3125     (ISET . "http://www.w3.org/2001/02/infoset#"))
3126   '(*TOP* (@ (*NAMESPACES*
3127         (RDF "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3128         (RSS "http://my.netscape.com/rdf/simple/0.9/")
3129         (ISET "http://www.w3.org/2001/02/infoset#")))
3130       (*PI* xml "version='1.0'")
3131       (RDF:RDF (RSS:channel
3132                  (RSS:title "Daemon News Mall")
3133                  (RSS:link "http://mall.daemonnews.org/")
3134                  (RSS:description "Central source for all your BSD needs"))
3135                (RSS:item
3136                  (RSS:title
3137                    "Daemon News Jan/Feb Issue NOW Available! Subscribe $24.95")
3138                  (RSS:link
3139                    "http://mall.daemonnews.org/?page=shop/flypage&product_id=880"))
3140                (RSS:item
3141                  (RSS:title
3142                    "The Design and Implementation of the 4.4BSD Operating System $54.95")
3143                  (RSS:link
3144                    "http://mall.daemonnews.org/?page=shop/flypage&product_id=912&category_id=1761")))))
3145
3146    (test (string-concatenate/shared (list-intersperse 
3147       '("<Forecasts TStamp='958082142'>"
3148         "<TAF TStamp='958066200' LatLon='36.583, -121.850' BId='724915'"
3149         "  SName='KMRY, MONTEREY PENINSULA'>"
3150         "<VALID TRange='958068000, 958154400'>111730Z 111818</VALID>"
3151         "<PERIOD TRange='958068000, 958078800'>"
3152         "<PREVAILING>31010KT P6SM FEW030</PREVAILING>"
3153         "</PERIOD>"
3154         "<PERIOD TRange='958078800, 958104000' Title='FM2100'>"
3155         "<PREVAILING>29016KT P6SM FEW040</PREVAILING>"
3156         "</PERIOD>"
3157         "<PERIOD TRange='958104000, 958154400' Title='FM0400'>"
3158         "<PREVAILING>29010KT P6SM SCT200</PREVAILING>"
3159         "<VAR Title='BECMG 0708' TRange='958114800, 958118400'>VRB05KT</VAR>"
3160         "</PERIOD></TAF>"
3161         "</Forecasts>")
3162       (string #\newline)
3163       ))
3164          '()
3165          '(*TOP* (Forecasts
3166                   (@ (TStamp "958082142"))
3167                   (TAF (@ (TStamp "958066200")
3168                           (SName "KMRY, MONTEREY PENINSULA")
3169                           (LatLon "36.583, -121.850")
3170                           (BId "724915"))
3171              (VALID (@ (TRange "958068000, 958154400")) "111730Z 111818")
3172              (PERIOD (@ (TRange "958068000, 958078800"))
3173                      (PREVAILING "31010KT P6SM FEW030"))
3174              (PERIOD (@ (Title "FM2100") (TRange "958078800, 958104000"))
3175                      (PREVAILING "29016KT P6SM FEW040"))
3176              (PERIOD (@ (Title "FM0400") (TRange "958104000, 958154400"))
3177                      (PREVAILING "29010KT P6SM SCT200")
3178                      (VAR (@ (Title "BECMG 0708")
3179                              (TRange "958114800, 958118400"))
3180                           "VRB05KT"))))))
3181))
3182
3183(run-test
3184 (newline)
3185 (display "All tests passed")
3186 (newline)
3187)
Note: See TracBrowser for help on using the repository browser.