source: project/release/4/ssax/SSAX.scm @ 15197

Last change on this file since 15197 was 15197, checked in by Ivan Raikov, 11 years ago

applied ssax patch from ticket #58

File size: 122.1 KB
Line 
1;       Functional XML parsing framework: SAX/DOM and SXML parsers
2;             with support for XML Namespaces and validation
3;
4; This is a package of low-to-high level lexing and parsing procedures
5; that can be combined to yield a SAX, a DOM, a validating parsers, or
6; a parser intended for a particular document type. The procedures in
7; the package can be used separately to tokenize or parse various
8; pieces of XML documents. The package supports XML Namespaces,
9; internal and external parsed entities, user-controlled handling of
10; whitespace, and validation. This module therefore is intended to be
11; a framework, a set of "Lego blocks" you can use to build a parser
12; following any discipline and performing validation to any degree. As
13; an example of the parser construction, this file includes a
14; semi-validating SXML parser.
15
16; The present XML framework has a "sequential" feel of SAX yet a
17; "functional style" of DOM. Like a SAX parser, the framework scans
18; the document only once and permits incremental processing. An
19; application that handles document elements in order can run as
20; efficiently as possible. _Unlike_ a SAX parser, the framework does
21; not require an application register stateful callbacks and surrender
22; control to the parser. Rather, it is the application that can drive
23; the framework -- calling its functions to get the current lexical or
24; syntax element. These functions do not maintain or mutate any state
25; save the input port. Therefore, the framework permits parsing of XML
26; in a pure functional style, with the input port being a monad (or a
27; linear, read-once parameter).
28
29; Besides the PORT, there is another monad -- SEED. Most of the
30; middle- and high-level parsers are single-threaded through the
31; seed. The functions of this framework do not process or affect the
32; SEED in any way: they simply pass it around as an instance of an
33; opaque datatype.  User functions, on the other hand, can use the
34; seed to maintain user's state, to accumulate parsing results, etc. A
35; user can freely mix his own functions with those of the
36; framework. On the other hand, the user may wish to instantiate a
37; high-level parser: ssax:make-elem-parser or ssax:make-parser.  In
38; the latter case, the user must provide functions of specific
39; signatures, which are called at predictable moments during the
40; parsing: to handle character data, element data, or processing
41; instructions (PI). The functions are always given the SEED, among
42; other parameters, and must return the new SEED.
43
44; From a functional point of view, XML parsing is a combined
45; pre-post-order traversal of a "tree" that is the XML document
46; itself. This down-and-up traversal tells the user about an element
47; when its start tag is encountered. The user is notified about the
48; element once more, after all element's children have been
49; handled. The process of XML parsing therefore is a fold over the
50; raw XML document. Unlike a fold over trees defined in [1], the
51; parser is necessarily single-threaded -- obviously as elements
52; in a text XML document are laid down sequentially. The parser
53; therefore is a tree fold that has been transformed to accept an
54; accumulating parameter [1,2].
55
56; Formally, the denotational semantics of the parser can be expressed
57; as
58; parser:: (Start-tag -> Seed -> Seed) ->
59;          (Start-tag -> Seed -> Seed -> Seed) ->
60;          (Char-Data -> Seed -> Seed) ->
61;          XML-text-fragment -> Seed -> Seed
62; parser fdown fup fchar "<elem attrs> content </elem>" seed
63;  = fup "<elem attrs>" seed
64;       (parser fdown fup fchar "content" (fdown "<elem attrs>" seed))
65;
66; parser fdown fup fchar "char-data content" seed
67;  = parser fdown fup fchar "content" (fchar "char-data" seed)
68;
69; parser fdown fup fchar "elem-content content" seed
70;  = parser fdown fup fchar "content" (
71;       parser fdown fup fchar "elem-content" seed)
72
73; Compare the last two equations with the left fold
74; fold-left kons elem:list seed = fold-left kons list (kons elem seed)
75
76; The real parser created my ssax:make-parser is slightly more complicated,
77; to account for processing instructions, entity references, namespaces,
78; processing of document type declaration, etc.
79
80
81; The XML standard document referred to in this module is
82;       http://www.w3.org/TR/1998/REC-xml-19980210.html
83;
84; The present file also defines a procedure that parses the text of an
85; XML document or of a separate element into SXML, an
86; S-expression-based model of an XML Information Set. SXML is also an
87; Abstract Syntax Tree of an XML document. SXML is similar
88; but not identical to DOM; SXML is particularly suitable for
89; Scheme-based XML/HTML authoring, SXPath queries, and tree
90; transformations. See SXML.html for more details.
91; SXML is a term implementation of evaluation of the XML document [3].
92; The other implementation is context-passing.
93
94; The present frameworks fully supports the XML Namespaces Recommendation:
95;       http://www.w3.org/TR/REC-xml-names/
96; Other links:
97; [1] Jeremy Gibbons, Geraint Jones, "The Under-appreciated Unfold,"
98; Proc. ICFP'98, 1998, pp. 273-279.
99; [2] Richard S. Bird, The promotion and accumulation strategies in
100; transformational programming, ACM Trans. Progr. Lang. Systems,
101; 6(4):487-504, October 1984.
102; [3] Ralf Hinze, "Deriving Backtracking Monad Transformers,"
103; Functional Pearl. Proc ICFP'00, pp. 186-197.
104
105; IMPORT
106; parser-error ssax:warn, see Handling of errors, below
107; functions declared in files util.scm, input-parse.scm and look-for-str.scm
108; char-encoding.scm for various platform-specific character-encoding functions.
109; From SRFI-13: string-concatenate/shared and string-concatenate-reverse/shared
110; If a particular implementation lacks SRFI-13 support, please
111; include the file srfi-13-local.scm
112
113; Handling of errors
114; This package relies on a function parser-error, which must be defined
115; by a user of the package. The function has the following signature:
116;       parser-error PORT MESSAGE SPECIALISING-MSG*
117; Many procedures of this package call 'parser-error' whenever a
118; parsing, well-formedness or validation error is encountered. The
119; first argument is a port, which typically points to the offending
120; character or its neighborhood. Most of the Scheme systems let the
121; user query a PORT for the current position. The MESSAGE argument
122; indicates a failed XML production or a failed XML constraint. The
123; latter is referred to by its anchor name in the XML Recommendation
124; or XML Namespaces Recommendation. The parsing library (e.g.,
125; next-token, assert-curr-char) invoke 'parser-error' as well, in
126; exactly the same way.  See input-parse.scm for more details.
127; See
128;       http://pair.com/lisovsky/download/parse-error.scm
129; for an excellent example of such a redefined parser-error function.
130;
131; In addition, the present code invokes a function ssax:warn
132;   ssax:warn PORT MESSAGE SPECIALISING-MSG*
133; to notify the user about warnings that are NOT errors but still
134; may alert the user.
135;
136; Again, parser-error and ssax:warn are supposed to be defined by the
137; user. However, if a run-test macro below is set to include
138; self-tests, this present code does provide the definitions for these
139; functions to allow tests to run.
140
141; Misc notes
142; It seems it is highly desirable to separate tests out in a dedicated
143; file.
144;
145; Jim Bender wrote on Mon, 9 Sep 2002 20:03:42 EDT on the SSAX-SXML
146; mailing list (message A fine-grained "lego")
147; The task was to record precise source location information, as PLT
148; does with its current XML parser. That parser records the start and
149; end location (filepos, line#, column#) for pi, elements, attributes,
150; chuncks of "pcdata".
151; As suggested above, though, in some cases I needed to be able force
152; open an interface that did not yet exist. For instance, I added an
153; "end-char-data-hook", which would be called at the end of char-data
154; fragment. This returns a function of type (seed -> seed) which is
155; invoked on the current seed only if read-char-data has indeed reached
156; the end of a block of char data (after reading a new token.
157; But the deepest interface that I needed to expose was that of reading
158; attributes. In the official distribution, this is not even a separate
159; function. Instead, it is embedded within SSAX:read-attributes.  This
160; required some small re-structuring as well.
161; This definitely will not be to everyone's taste (nor needed by most).
162; Certainly, the existing make-parser interface addresses most custom
163; needs. And likely 80-90 lines of a "link specification" to create a
164; parser from many tiny little lego blocks may please only a few, while
165; appalling others.
166; The code is available at http://celtic.benderweb.net/ssax-lego.plt or
167; http://celtic.benderweb.net/ssax-lego.tar.gz
168; In the examples directory, I provide:
169; - a unit version of the make-parser interface,
170; - a simple SXML parser using that interface,
171; - an SXML parser which directly uses the "new lego",
172; - a pseudo-SXML parser, which records source location information
173; - and lastly a parser which returns the structures used in PLT's xml
174; collection, with source location information
175
176; $Id: SSAX.scm,v 5.4 2004/11/09 20:22:26 sperber Exp $
177;^^^^^^^^^
178
179
180        ; See the Makefile in the ../tests directory
181        ; (in particular, the rule vSSAX) for an example of how
182        ; to run this code on various Scheme systems.
183        ; See SSAX examples for many samples of using this code,
184        ; again, on a variety of Scheme systems.
185        ; See http://ssax.sf.net/
186
187
188; The following macro runs built-in test cases -- or does not run,
189; depending on which of the two cases below you commented out
190; Case 1: no tests:
191;(define-macro run-test (lambda body '(begin #f)))
192
193(define-syntax run-test (syntax-rules () ((run-test . args) (begin #f))))
194
195; Case 2: with tests.
196; The following macro could've been defined just as
197; (define-macro run-test (lambda body `(begin (display "\n-->Test\n") ,@body)))
198;
199; Instead, it's more involved, to make up for case-insensitivity of
200; symbols on some Scheme systems. In Gambit, symbols are case
201; sensitive: (eq? 'A 'a) is #f and (eq? 'Aa (string->symbol "Aa")) is
202; #t.  On some systems, symbols are case-insensitive and just the
203; opposite is true.  Therefore, we introduce a notation '"ASymbol" (a
204; quoted string) that stands for a case-_sensitive_ ASymbol -- on any
205; R5RS Scheme system. This notation is valid only within the body of
206; run-test.
207; The notation is implemented by scanning the run-test's
208; body and replacing every occurrence of (quote "str") with the result
209; of (string->symbol "str"). We can do such a replacement at macro-expand
210; time (rather than at run time).
211
212; Here's the previous version of run-test, implemented as a low-level
213; macro.
214; (define-macro run-test
215;   (lambda body
216;     (define (re-write body)
217;       (cond
218;        ((vector? body)
219;       (list->vector (re-write (vector->list body))))
220;        ((not (pair? body)) body)
221;        ((and (eq? 'quote (car body)) (pair? (cdr body))
222;            (string? (cadr body)))
223;       (string->symbol (cadr body)))
224;        (else (cons (re-write (car body)) (re-write (cdr body))))))
225;     (cons 'begin (re-write body))))
226;
227; For portability, it is re-written as syntax-rules. The syntax-rules
228; version is less powerful: for example, it can't handle
229; (case x (('"Foo") (do-on-Foo))) whereas the low-level macro
230; could correctly place a case-sensitive symbol at the right place.
231; We also do not scan vectors (because we don't use them here).
232; Twice-deep quasiquotes aren't handled either.
233; Still, the syntax-rules version satisfies our immediate needs.
234; Incidentally, I originally didn't believe that the macro below
235; was at all possible.
236;
237; The macro is written in a continuation-passing style. A continuation
238; typically has the following structure: (k-head ! . args)
239; When the continuation is invoked, we expand into
240; (k-head <computed-result> . arg). That is, the dedicated symbol !
241; is the placeholder for the result.
242;
243; It seems that the most modular way to write the run-test macro would
244; be the following
245;
246; (define-syntax run-test
247;  (syntax-rules ()
248;   ((run-test . ?body)
249;     (letrec-syntax
250;       ((scan-exp                      ; (scan-exp body k)
251;        (syntax-rules (quote quasiquote !)
252;          ((scan-exp (quote (hd . tl)) k)
253;            (scan-lit-lst (hd . tl) (do-wrap ! quasiquote k)))
254;          ((scan-exp (quote x) (k-head ! . args))
255;            (k-head
256;              (if (string? (quote x)) (string->symbol (quote x)) (quote x))
257;              . args))
258;          ((scan-exp (hd . tl) k)
259;            (scan-exp hd (do-tl ! scan-exp tl k)))
260;          ((scan-exp x (k-head ! . args))
261;            (k-head x . args))))
262;       (do-tl
263;         (syntax-rules (!)
264;           ((do-tl processed-hd fn () (k-head ! . args))
265;             (k-head (processed-hd) . args))
266;           ((do-tl processed-hd fn old-tl k)
267;             (fn old-tl (do-cons ! processed-hd k)))))
268;       ...
269;       (do-finish
270;         (syntax-rules ()
271;           ((do-finish (new-body)) new-body)
272;           ((do-finish new-body) (begin . new-body))))
273;       ...
274;       (scan-exp ?body (do-finish !))
275; ))))
276;
277; Alas, that doesn't work on all systems. We hit yet another dark
278; corner of the R5RS macros. The reason is that run-test is used in
279; the code below to introduce definitions. For example:
280; (run-test
281;  (define (ssax:warn port msg . other-msg)
282;    (apply cerr (cons* nl "Warning: " msg other-msg)))
283; )
284; This code expands to
285; (begin
286;    (define (ssax:warn port msg . other-msg) ...))
287; so the definition gets spliced in into the top level. Right?
288; Well, On Petite Chez Scheme it is so. However, many other systems
289; don't like this approach. The reason is that the invocation of
290; (run-test (define (ssax:warn port msg . other-msg) ...))
291; first expands into
292; (letrec-syntax (...)
293;   (scan-exp ((define (ssax:warn port msg . other-msg) ...)) ...))
294; because of the presence of (letrec-syntax ...), the begin form that
295; is generated eventually is no longer at the top level! The begin
296; form in Scheme is an overloading of two distinct forms: top-level
297; begin and the other begin. The forms have different rules: for example,
298; (begin (define x 1)) is OK for a top-level begin but not OK for
299; the other begin. Some Scheme systems see the that the macro
300; (run-test ...) expands into (letrec-syntax ...) and decide right there
301; that any further (begin ...) forms are NOT top-level begin forms.
302; The only way out is to make sure all our macros are top-level.
303; The best approach <sigh> seems to be to make run-test one huge
304; top-level macro.
305
306
307;; (define-syntax run-test
308;;  (syntax-rules (define)
309;;    ((run-test "scan-exp" (define vars body))
310;;     (define vars (run-test "scan-exp" body)))
311;;    ((run-test "scan-exp" ?body)
312;;     (letrec-syntax
313;;       ((scan-exp                     ; (scan-exp body k)
314;;       (syntax-rules (quote quasiquote !)
315;;         ((scan-exp '() (k-head ! . args))
316;;           (k-head '() . args))
317;;         ((scan-exp (quote (hd . tl)) k)
318;;           (scan-lit-lst (hd . tl) (do-wrap ! quasiquote k)))
319;;         ((scan-exp (quasiquote (hd . tl)) k)
320;;           (scan-lit-lst (hd . tl) (do-wrap ! quasiquote k)))
321;;         ((scan-exp (quote x) (k-head ! . args))
322;;           (k-head
323;;             (if (string? (quote x)) (string->symbol (quote x)) (quote x))
324;;             . args))
325;;         ((scan-exp (hd . tl) k)
326;;           (scan-exp hd (do-tl ! scan-exp tl k)))
327;;         ((scan-exp x (k-head ! . args))
328;;           (k-head x . args))))
329;;      (do-tl
330;;        (syntax-rules (!)
331;;          ((do-tl processed-hd fn () (k-head ! . args))
332;;            (k-head (processed-hd) . args))
333;;          ((do-tl processed-hd fn old-tl k)
334;;            (fn old-tl (do-cons ! processed-hd k)))))
335;;      (do-cons
336;;        (syntax-rules (!)
337;;          ((do-cons processed-tl processed-hd (k-head ! . args))
338;;            (k-head (processed-hd . processed-tl) . args))))
339;;      (do-wrap
340;;        (syntax-rules (!)
341;;          ((do-wrap val fn (k-head ! . args))
342;;            (k-head (fn val) . args))))
343;;      (do-finish
344;;        (syntax-rules ()
345;;          ((do-finish new-body) new-body)))
346
347;;      (scan-lit-lst                   ; scan literal list
348;;        (syntax-rules (quote unquote unquote-splicing !)
349;;         ((scan-lit-lst '() (k-head ! . args))
350;;           (k-head '() . args))
351;;         ((scan-lit-lst (quote (hd . tl)) k)
352;;           (do-tl quote scan-lit-lst ((hd . tl)) k))
353;;         ((scan-lit-lst (unquote x) k)
354;;           (scan-exp x (do-wrap ! unquote k)))
355;;         ((scan-lit-lst (unquote-splicing x) k)
356;;           (scan-exp x (do-wrap ! unquote-splicing k)))
357;;         ((scan-lit-lst (quote x) (k-head ! . args))
358;;           (k-head
359;;             ,(if (string? (quote x)) (string->symbol (quote x)) (quote x))
360;;             . args))
361;;          ((scan-lit-lst (hd . tl) k)
362;;            (scan-lit-lst hd (do-tl ! scan-lit-lst tl k)))
363;;          ((scan-lit-lst x (k-head ! . args))
364;;            (k-head x . args))))
365;;      )
366;;       (scan-exp ?body (do-finish !))))
367;;   ((run-test body ...)
368;;    (begin
369;;      (run-test "scan-exp" body) ...))
370;; ))
371
372;========================================================================
373;                               Data Types
374
375; TAG-KIND
376;       a symbol 'START, 'END, 'PI, 'DECL, 'COMMENT, 'CDSECT
377;               or 'ENTITY-REF that identifies a markup token
378
379; UNRES-NAME
380;       a name (called GI in the XML Recommendation) as given in an xml
381;       document for a markup token: start-tag, PI target, attribute name.
382;       If a GI is an NCName, UNRES-NAME is this NCName converted into
383;       a Scheme symbol. If a GI is a QName, UNRES-NAME is a pair of
384;       symbols: (PREFIX . LOCALPART)
385
386; RES-NAME
387;       An expanded name, a resolved version of an UNRES-NAME.
388;       For an element or an attribute name with a non-empty namespace URI,
389;       RES-NAME is a pair of symbols, (URI-SYMB . LOCALPART).
390;       Otherwise, it's a single symbol.
391
392; ELEM-CONTENT-MODEL
393;       A symbol:
394;       ANY       - anything goes, expect an END tag.
395;       EMPTY-TAG - no content, and no END-tag is coming
396;       EMPTY     - no content, expect the END-tag as the next token
397;       PCDATA    - expect character data only, and no children elements
398;       MIXED
399;       ELEM-CONTENT
400
401; URI-SYMB
402;       A symbol representing a namespace URI -- or other symbol chosen
403;       by the user to represent URI. In the former case,
404;       URI-SYMB is created by %-quoting of bad URI characters and
405;       converting the resulting string into a symbol.
406
407; NAMESPACES
408;       A list representing namespaces in effect. An element of the list
409;       has one of the following forms:
410;       (PREFIX URI-SYMB . URI-SYMB) or
411;       (PREFIX USER-PREFIX . URI-SYMB)
412;               USER-PREFIX is a symbol chosen by the user
413;               to represent the URI.
414;       (#f USER-PREFIX . URI-SYMB)
415;               Specification of the user-chosen prefix and a URI-SYMBOL.
416;       (*DEFAULT* USER-PREFIX . URI-SYMB)
417;               Declaration of the default namespace
418;       (*DEFAULT* #f . #f)
419;               Un-declaration of the default namespace. This notation
420;               represents overriding of the previous declaration
421;       A NAMESPACES list may contain several elements for the same PREFIX.
422;       The one closest to the beginning of the list takes effect.
423
424; ATTLIST
425;       An ordered collection of (NAME . VALUE) pairs, where NAME is
426;       a RES-NAME or an UNRES-NAME. The collection is an ADT
427
428; STR-HANDLER
429;       A procedure of three arguments: STRING1 STRING2 SEED
430;       returning a new SEED
431;       The procedure is supposed to handle a chunk of character data
432;       STRING1 followed by a chunk of character data STRING2.
433;       STRING2 is a short string, often "\n" and even ""
434
435; ENTITIES
436;       An assoc list of pairs:
437;          (named-entity-name . named-entity-body)
438;       where named-entity-name is a symbol under which the entity was
439;       declared, named-entity-body is either a string, or
440;       (for an external entity) a thunk that will return an
441;       input port (from which the entity can be read).
442;       named-entity-body may also be #f. This is an indication that a
443;       named-entity-name is currently being expanded. A reference to
444;       this named-entity-name will be an error: violation of the
445;       WFC nonrecursion.
446
447; XML-TOKEN -- a record
448
449; In Gambit, you can use the following declaration:
450; (define-structure xml-token kind head)
451; The following declaration is "standard" as it follows SRFI-9:
452; (define-record-type  xml-token  (make-xml-token kind head)  xml-token?
453;   (kind  xml-token-kind)
454;   (head  xml-token-head) )
455; No field mutators are declared as SSAX is a pure functional parser
456;
457; But to make the code more portable, we define xml-token simply as
458; a pair. It suffices for us. Furthermore, xml-token-kind and xml-token-head
459; can be defined as simple procedures. However, they are declared as
460; macros below for efficiency.
461
462; procedure: make-xml-token KIND HEAD
463; This creates an XML token.
464(define (make-xml-token kind head) (cons kind head))
465; procedure: xml-token? THING
466(define xml-token? pair?)
467; syntax: xml-token-kind XML-TOKEN
468(define-syntax xml-token-kind 
469  (syntax-rules () ((xml-token-kind token) (car token))))
470; syntax: xml-token-head XML-TOKEN
471(define-syntax xml-token-head 
472  (syntax-rules () ((xml-token-head token) (cdr token))))
473
474; (define-macro xml-token-kind (lambda (token) `(car ,token)))
475; (define-macro xml-token-head (lambda (token) `(cdr ,token)))
476
477; This record represents a markup, which is, according to the XML
478; Recommendation, "takes the form of start-tags, end-tags, empty-element tags,
479; entity references, character references, comments, CDATA section delimiters,
480; document type declarations, and processing instructions."
481;
482;       kind -- a TAG-KIND
483;       head -- an UNRES-NAME. For xml-tokens of kinds 'COMMENT and
484;               'CDSECT, the head is #f
485;
486; For example,
487;       <P>  => kind='START, head='P
488;       </P> => kind='END, head='P
489;       <BR/> => kind='EMPTY-EL, head='BR
490;       <!DOCTYPE OMF ...> => kind='DECL, head='DOCTYPE
491;       <?xml version="1.0"?> => kind='PI, head='xml
492;       &my-ent; => kind = 'ENTITY-REF, head='my-ent
493;
494; Character references are not represented by xml-tokens as these references
495; are transparently resolved into the corresponding characters.
496;
497
498
499
500; XML-DECL -- a record
501
502; The following is Gambit-specific, see below for a portable declaration
503;(define-structure xml-decl elems entities notations)
504
505; The record represents a datatype of an XML document: the list of
506; declared elements and their attributes, declared notations, list of
507; replacement strings or loading procedures for parsed general
508; entities, etc. Normally an xml-decl record is created from a DTD or
509; an XML Schema, although it can be created and filled in in many other
510; ways (e.g., loaded from a file).
511;
512; elems: an (assoc) list of decl-elem or #f. The latter instructs
513;       the parser to do no validation of elements and attributes.
514;
515; decl-elem: declaration of one element:
516;       (elem-name elem-content decl-attrs)
517;       elem-name is an UNRES-NAME for the element.
518;       elem-content is an ELEM-CONTENT-MODEL.
519;       decl-attrs is an ATTLIST, of (ATTR-NAME . VALUE) associations
520; !!!This element can declare a user procedure to handle parsing of an
521; element (e.g., to do a custom validation, or to build a hash of
522; IDs as they're encountered).
523;
524; decl-attr: an element of an ATTLIST, declaration of one attribute
525;       (attr-name content-type use-type default-value)
526;       attr-name is an UNRES-NAME for the declared attribute
527;       content-type is a symbol: CDATA, NMTOKEN, NMTOKENS, ...
528;               or a list of strings for the enumerated type.
529;       use-type is a symbol: REQUIRED, IMPLIED, FIXED
530;       default-value is a string for the default value, or #f if not given.
531;
532;
533
534; see a function make-empty-xml-decl to make a XML declaration entry
535; suitable for a non-validating parsing.
536
537
538;-------------------------
539; Utilities
540
541; The following is a function that is often used in validation tests,
542; to make sure that the computed result matches the expected one.
543; This function is a standard equal? predicate with one exception.
544; On Scheme systems where (string->symbol "A") and a symbol A
545; are the same, equal_? is precisely equal?
546; On other Scheme systems, we compare symbols disregarding their case.
547; Since this function is used only in tests, we don't have to
548; strive to make it efficient.
549(run-test
550 (define (equal_? e1 e2)
551   (if (eq? 'A (string->symbol "A")) (equal? e1 e2)
552       (cond
553        ((symbol? e1)
554         (and (symbol? e2) 
555              (string-ci=? (symbol->string e1) (symbol->string e2))))
556        ((pair? e1)
557         (and (pair? e2)
558              (equal_? (car e1) (car e2)) (equal_? (cdr e1) (cdr e2))))
559        ((vector? e1)
560         (and (vector? e2) (equal_? (vector->list e1) (vector->list e2))))
561        (else
562         (equal? e1 e2)))))
563)
564
565; The following function, which is often used in validation tests,
566; lets us conveniently enter newline, CR and tab characters in a character
567; string.
568;       unesc-string: ESC-STRING -> STRING
569; where ESC-STRING is a character string that may contain
570;    %n  -- for #\newline
571;    %r  -- for #\return
572;    %t  -- for #\tab
573;    %%  -- for #\%
574;
575; The result of unesc-string is a character string with all %-combinations
576; above replaced with their character equivalents
577
578(run-test
579 (define (unesc-string str)
580   (call-with-input-string str
581     (lambda (port)
582       (let loop ((frags '()))
583         (let* ((token (next-token '() '(#\% *eof*) "unesc-string" port))
584                (cterm (read-char port))
585                (frags (cons token frags)))
586           (if (eof-object? cterm) (string-concatenate-reverse/shared frags)
587             (let ((cchar (read-char port)))  ; char after #\%
588               (if (eof-object? cchar)
589                 (error "unexpected EOF after reading % in unesc-string:" str)
590                 (loop
591                   (cons
592                     (case cchar
593                       ((#\n) (string #\newline))
594                       ((#\r) (string char-return))
595                       ((#\t) (string char-tab))
596                       ((#\%) "%")
597                       (else (error "bad %-char in unesc-string:" cchar)))
598                     frags))))))))))
599)
600             
601
602; Test if a string is made of only whitespace
603; An empty string is considered made of whitespace as well
604(define (string-whitespace? str)
605  (let ((len (string-length str)))
606    (cond
607     ((zero? len) #t)
608     ((= 1 len) (char-whitespace? (string-ref str 0)))
609     ((= 2 len) (and (char-whitespace? (string-ref str 0))
610                     (char-whitespace? (string-ref str 1))))
611     (else
612      (let loop ((i 0))
613        (or (>= i len)
614            (and (char-whitespace? (string-ref str i))
615                 (loop (inc i)))))))))
616
617; Find val in alist
618; Return (values found-el remaining-alist) or
619;        (values #f alist)
620
621(define (assq-values val alist)
622  (let loop ((alist alist) (scanned '()))
623    (cond
624     ((null? alist) (values #f scanned))
625     ((equal? val (caar alist))
626      (values (car alist) (append scanned (cdr alist))))
627     (else
628      (loop (cdr alist) (cons (car alist) scanned))))))
629
630; From SRFI-1
631(define (fold-right kons knil lis1)
632    (let recur ((lis lis1))
633       (if (null? lis) knil
634            (let ((head (car lis)))
635              (kons head (recur (cdr lis)))))))
636
637; Left fold combinator for a single list
638(define (fold kons knil lis1)
639  (let lp ((lis lis1) (ans knil))
640    (if (null? lis) ans
641      (lp (cdr lis) (kons (car lis) ans)))))
642
643
644
645;========================================================================
646;               Lower-level parsers and scanners
647;
648; They deal with primitive lexical units (Names, whitespaces, tags)
649; and with pieces of more generic productions. Most of these parsers
650; must be called in appropriate context. For example, ssax:complete-start-tag
651; must be called only when the start-tag has been detected and its GI
652; has been read.
653
654;------------------------------------------------------------------------
655;                       Low-level parsing code
656
657; Skip the S (whitespace) production as defined by
658; [3] S ::= (#x20 | #x9 | #xD | #xA)
659; The procedure returns the first not-whitespace character it
660; encounters while scanning the PORT. This character is left
661; on the input stream.
662
663(define ssax:S-chars (map ascii->char '(32 10 9 13)))
664
665(define (ssax:skip-S port)
666  (skip-while ssax:S-chars port))
667
668
669; Read a Name lexem and return it as string
670; [4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':'
671;                  | CombiningChar | Extender
672; [5] Name ::= (Letter | '_' | ':') (NameChar)*
673;
674; This code supports the XML Namespace Recommendation REC-xml-names,
675; which modifies the above productions as follows:
676;
677; [4] NCNameChar ::= Letter | Digit | '.' | '-' | '_'
678;                       | CombiningChar | Extender
679; [5] NCName ::= (Letter | '_') (NCNameChar)*
680; As the Rec-xml-names says,
681; "An XML document conforms to this specification if all other tokens
682; [other than element types and attribute names] in the document which
683; are required, for XML conformance, to match the XML production for
684; Name, match this specification's production for NCName."
685; Element types and attribute names must match the production QName,
686; defined below.
687
688; Check to see if a-char may start a NCName
689(define (ssax:ncname-starting-char? a-char)
690  (and (char? a-char)
691    (or
692      (char-alphabetic? a-char)
693      (char=? #\_ a-char))))
694
695
696; Read a NCName starting from the current position in the PORT and
697; return it as a symbol.
698(define (ssax:read-NCName port)
699  (let ((first-char (peek-char port)))
700    (or (ssax:ncname-starting-char? first-char)
701      (parser-error port "XMLNS [4] for '" first-char "'")))
702  (string->symbol
703    (next-token-of
704      (lambda (c)
705        (cond
706          ((eof-object? c) #f)
707          ((char-alphabetic? c) c)
708          ((string-index "0123456789.-_" c) c)
709          (else #f)))
710      port)))
711
712; Read a (namespace-) Qualified Name, QName, from the current
713; position in the PORT.
714; From REC-xml-names:
715;       [6] QName ::= (Prefix ':')? LocalPart
716;       [7] Prefix ::= NCName
717;       [8] LocalPart ::= NCName
718; Return: an UNRES-NAME
719(define (ssax:read-QName port)
720  (let ((prefix-or-localpart (ssax:read-NCName port)))
721    (case (peek-char port)
722      ((#\:)                    ; prefix was given after all
723       (read-char port)         ; consume the colon
724       (cons prefix-or-localpart (ssax:read-NCName port)))
725      (else prefix-or-localpart) ; Prefix was omitted
726      )))
727
728; The prefix of the pre-defined XML namespace
729(define ssax:Prefix-XML (string->symbol "xml"))
730
731(run-test
732 (assert (eq? '_
733                 (call-with-input-string "_" ssax:read-NCName)))
734 (assert (eq? '_
735                 (call-with-input-string "_" ssax:read-QName)))
736 (assert (eq? (string->symbol "_abc_")
737              (call-with-input-string "_abc_;" ssax:read-NCName)))
738 (assert (eq? (string->symbol "_abc_")
739              (call-with-input-string "_abc_;" ssax:read-QName)))
740 (assert (eq? (string->symbol "_a.b")
741              (call-with-input-string "_a.b " ssax:read-QName)))
742 (assert (equal? (cons (string->symbol "_a.b") (string->symbol "d.1-ef-"))
743              (call-with-input-string "_a.b:d.1-ef-;" ssax:read-QName)))
744 (assert (equal? (cons (string->symbol "a") (string->symbol "b"))
745              (call-with-input-string "a:b:c" ssax:read-QName)))
746
747 (assert (failed? (call-with-input-string ":abc" ssax:read-NCName)))
748 (assert (failed? (call-with-input-string "1:bc" ssax:read-NCName)))
749)
750
751; Compare one RES-NAME or an UNRES-NAME with the other.
752; Return a symbol '<, '>, or '= depending on the result of
753; the comparison.
754; Names without PREFIX are always smaller than those with the PREFIX.
755(define name-compare
756  (letrec ((symbol-compare
757            (lambda (symb1 symb2)
758              (cond
759               ((eq? symb1 symb2) '=)
760               ((string<? (symbol->string symb1) (symbol->string symb2))
761                '<)
762               (else '>)))))
763    (lambda (name1 name2)
764      (cond
765       ((symbol? name1) (if (symbol? name2) (symbol-compare name1 name2)
766                            '<))
767       ((symbol? name2) '>)
768       ((eq? name2 ssax:largest-unres-name) '<)
769       ((eq? name1 ssax:largest-unres-name) '>)
770       ((eq? (car name1) (car name2))   ; prefixes the same
771        (symbol-compare (cdr name1) (cdr name2)))
772       (else (symbol-compare (car name1) (car name2)))))))
773
774; An UNRES-NAME that is postulated to be larger than anything that can occur in
775; a well-formed XML document.
776; name-compare enforces this postulate.
777(define ssax:largest-unres-name (cons
778                                  (string->symbol "#LARGEST-SYMBOL")
779                                  (string->symbol "#LARGEST-SYMBOL")))
780
781(run-test
782 (assert (eq? '= (name-compare 'ABC 'ABC)))
783 (assert (eq? '< (name-compare 'ABC 'ABCD)))
784 (assert (eq? '> (name-compare 'XB 'ABCD)))
785 (assert (eq? '> (name-compare '(HTML . PRE) 'PRE)))
786 (assert (eq? '< (name-compare 'HTML '(HTML . PRE))))
787 (assert (eq? '= (name-compare '(HTML . PRE) '(HTML . PRE))))
788 (assert (eq? '< (name-compare '(HTML . PRE) '(XML . PRE))))
789 (assert (eq? '> (name-compare '(HTML . PRE) '(HTML . P))))
790 (assert (eq? '< (name-compare '(HTML . PRE) ssax:largest-unres-name)))
791 (assert (eq? '< (name-compare '(ZZZZ . ZZZ) ssax:largest-unres-name)))
792 (assert (eq? '> (name-compare ssax:largest-unres-name '(ZZZZ . ZZZ) )))
793)
794
795
796
797; procedure: ssax:read-markup-token PORT
798; This procedure starts parsing of a markup token. The current position
799; in the stream must be #\<. This procedure scans enough of the input stream
800; to figure out what kind of a markup token it is seeing. The procedure returns
801; an xml-token structure describing the token. Note, generally reading
802; of the current markup is not finished! In particular, no attributes of
803; the start-tag token are scanned.
804;
805; Here's a detailed break out of the return values and the position in the PORT
806; when that particular value is returned:
807;       PI-token:       only PI-target is read.
808;                       To finish the Processing Instruction and disregard it,
809;                       call ssax:skip-pi. ssax:read-attributes may be useful
810;                       as well (for PIs whose content is attribute-value
811;                       pairs)
812;       END-token:      The end tag is read completely; the current position
813;                       is right after the terminating #\> character.   
814;       COMMENT         is read and skipped completely. The current position
815;                       is right after "-->" that terminates the comment.
816;       CDSECT          The current position is right after "<!CDATA["
817;                       Use ssax:read-cdata-body to read the rest.
818;       DECL            We have read the keyword (the one that follows "<!")
819;                       identifying this declaration markup. The current
820;                       position is after the keyword (usually a
821;                       whitespace character)
822;
823;       START-token     We have read the keyword (GI) of this start tag.
824;                       No attributes are scanned yet. We don't know if this
825;                       tag has an empty content either.
826;                       Use ssax:complete-start-tag to finish parsing of
827;                       the token.
828
829(define ssax:read-markup-token ; procedure ssax:read-markup-token port
830 (let ()
831                ; we have read "<!-". Skip through the rest of the comment
832                ; Return the 'COMMENT token as an indication we saw a comment
833                ; and skipped it.
834  (define (skip-comment port)
835    (assert-curr-char '(#\-) "XML [15], second dash" port)
836    (if (not (find-string-from-port? "-->" port))
837      (parser-error port "XML [15], no -->"))
838    (make-xml-token 'COMMENT #f))
839
840                ; we have read "<![" that must begin a CDATA section
841  (define (read-cdata port)
842    (assert (string=? "CDATA[" (read-string 6 port)))
843    (make-xml-token 'CDSECT #f))
844
845  (lambda (port)
846    (assert-curr-char '(#\<) "start of the token" port)
847    (case (peek-char port)
848      ((#\/) (read-char port)
849       (begin0 (make-xml-token 'END (ssax:read-QName port))
850               (ssax:skip-S port)
851               (assert-curr-char '(#\>) "XML [42]" port)))
852      ((#\?) (read-char port) (make-xml-token 'PI (ssax:read-NCName port)))
853      ((#\!)
854       (case (peek-next-char port)
855         ((#\-) (read-char port) (skip-comment port))
856         ((#\[) (read-char port) (read-cdata port))
857         (else (make-xml-token 'DECL (ssax:read-NCName port)))))
858      (else (make-xml-token 'START (ssax:read-QName port)))))
859))
860
861
862; The current position is inside a PI. Skip till the rest of the PI
863(define (ssax:skip-pi port)     
864  (if (not (find-string-from-port? "?>" port))
865    (parser-error port "Failed to find ?> terminating the PI")))
866
867
868; procedure: ssax:read-pi-body-as-string PORT
869; The current position is right after reading the PITarget. We read the
870; body of PI and return is as a string. The port will point to the
871; character right after '?>' combination that terminates PI.
872; [16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
873
874(define (ssax:read-pi-body-as-string port)
875  (ssax:skip-S port)            ; skip WS after the PI target name
876  (string-concatenate/shared
877    (let loop ()
878      (let ((pi-fragment
879             (next-token '() '(#\?) "reading PI content" port)))
880        (if (eqv? #\> (peek-next-char port))
881            (begin
882              (read-char port)
883              (cons pi-fragment '()))
884            (cons* pi-fragment "?" (loop)))))))
885
886(run-test
887 (assert (equal? "p1 content "
888    (call-with-input-string "<?pi1  p1 content ?>"
889      (lambda (port)
890        (ssax:read-markup-token port)
891        (ssax:read-pi-body-as-string port)))))
892 (assert (equal? "pi2? content? ?"
893    (call-with-input-string "<?pi2 pi2? content? ??>"
894      (lambda (port)
895        (ssax:read-markup-token port)
896        (ssax:read-pi-body-as-string port)))))
897)
898
899;(define (ssax:read-pi-body-as-name-values port)
900
901; procedure: ssax:skip-internal-dtd PORT
902; The current pos in the port is inside an internal DTD subset
903; (e.g., after reading #\[ that begins an internal DTD subset)
904; Skip until the "]>" combination that terminates this DTD
905(define (ssax:skip-internal-dtd port)     
906  (if (not (find-string-from-port? "]>" port))
907    (parser-error port
908                  "Failed to find ]> terminating the internal DTD subset")))
909
910
911; procedure+: ssax:read-cdata-body PORT STR-HANDLER SEED
912;
913; This procedure must be called after we have read a string "<![CDATA["
914; that begins a CDATA section. The current position must be the first
915; position of the CDATA body. This function reads _lines_ of the CDATA
916; body and passes them to a STR-HANDLER, a character data consumer.
917;
918; The str-handler is a STR-HANDLER, a procedure STRING1 STRING2 SEED.
919; The first STRING1 argument to STR-HANDLER never contains a newline.
920; The second STRING2 argument often will. On the first invocation of
921; the STR-HANDLER, the seed is the one passed to ssax:read-cdata-body
922; as the third argument. The result of this first invocation will be
923; passed as the seed argument to the second invocation of the line
924; consumer, and so on. The result of the last invocation of the
925; STR-HANDLER is returned by the ssax:read-cdata-body.  Note a
926; similarity to the fundamental 'fold' iterator.
927;
928; Within a CDATA section all characters are taken at their face value,
929; with only three exceptions:
930;       CR, LF, and CRLF are treated as line delimiters, and passed
931;       as a single #\newline to the STR-HANDLER
932;       "]]>" combination is the end of the CDATA section.
933;       &gt; is treated as an embedded #\> character
934; Note, &lt; and &amp; are not specially recognized (and are not expanded)!
935
936(define ssax:read-cdata-body 
937  (let ((cdata-delimiters (list char-return #\newline #\] #\&)))
938
939    (lambda (port str-handler seed)
940      (let loop ((seed seed))
941        (let ((fragment (next-token '() cdata-delimiters
942                                    "reading CDATA" port)))
943                        ; that is, we're reading the char after the 'fragment'
944     (case (read-char port)     
945       ((#\newline) (loop (str-handler fragment nl seed)))
946       ((#\])
947        (if (not (eqv? (peek-char port) #\]))
948            (loop (str-handler fragment "]" seed))
949            (let check-after-second-braket
950                ((seed (if (string-null? fragment) seed
951                           (str-handler fragment "" seed))))
952              (case (peek-next-char port)       ; after the second bracket
953                ((#\>) (read-char port) seed)   ; we have read "]]>"
954                ((#\]) (check-after-second-braket
955                        (str-handler "]" "" seed)))
956                (else (loop (str-handler "]]" "" seed)))))))
957       ((#\&)           ; Note that #\& within CDATA may stand for itself
958        (let ((ent-ref  ; it does not have to start an entity ref
959               (next-token-of (lambda (c) 
960                 (and (not (eof-object? c)) (char-alphabetic? c) c)) port)))
961          (cond         ; "&gt;" is to be replaced with #\>
962           ((and (string=? "gt" ent-ref) (eqv? (peek-char port) #\;))
963            (read-char port)
964            (loop (str-handler fragment ">" seed)))
965           (else
966            (loop 
967             (str-handler ent-ref ""
968                          (str-handler fragment "&" seed)))))))
969       (else            ; Must be CR: if the next char is #\newline, skip it
970         (if (eqv? (peek-char port) #\newline) (read-char port))
971         (loop (str-handler fragment nl seed)))
972       ))))))
973
974; a few lines of validation code
975(run-test (letrec
976  ((consumer (lambda (fragment foll-fragment seed)
977     (cons* (if (equal? foll-fragment (string #\newline))
978                " NL" foll-fragment) fragment seed)))
979   (test (lambda (str expected-result)
980           (newline) (display "body: ") (write str)
981           (newline) (display "Result: ")
982           (let ((result
983                   (reverse
984                     (call-with-input-string (unesc-string str)
985                       (lambda (port) (ssax:read-cdata-body port consumer '()))
986                       ))))
987             (write result)
988             (assert (equal? result expected-result)))))
989   )
990  (test "]]>" '())
991  (test "abcd]]>" '("abcd" ""))
992  (test "abcd]]]>" '("abcd" "" "]" ""))
993  (test "abcd]]]]>" '("abcd" "" "]" "" "]" ""))
994  (test "abcd]]]]]>" '("abcd" "" "]" "" "]" "" "]" ""))
995  (test "abcd]]]a]]>" '("abcd" "" "]" "" "]]" "" "a" ""))
996  (test "abc%r%ndef%n]]>" '("abc" " NL" "def" " NL"))
997  (test "%r%n%r%n]]>" '("" " NL" "" " NL"))
998  (test "%r%n%r%na]]>" '("" " NL" "" " NL" "a" ""))
999  (test "%r%r%r%na]]>" '("" " NL" "" " NL" "" " NL" "a" ""))
1000  (test "abc&!!!]]>" '("abc" "&" "" "" "!!!" ""))
1001  (test "abc]]&gt;&gt&amp;]]]&gt;and]]>"
1002    '("abc" "" "]]" "" "" ">" "" "&" "gt" "" "" "&" "amp" "" ";" "" "]" ""
1003      "]]" "" "" ">" "and" ""))
1004))
1005
1006           
1007; procedure+: ssax:read-char-ref PORT
1008;
1009; [66]  CharRef ::=  '&#' [0-9]+ ';'
1010;                  | '&#x' [0-9a-fA-F]+ ';'
1011;
1012; This procedure must be called after we we have read "&#"
1013; that introduces a char reference.
1014; The procedure reads this reference and returns the corresponding char
1015; The current position in PORT will be after ";" that terminates
1016; the char reference
1017; Faults detected:
1018;       WFC: XML-Spec.html#wf-Legalchar
1019;
1020; According to Section "4.1 Character and Entity References"
1021; of the XML Recommendation:
1022;  "[Definition: A character reference refers to a specific character
1023;   in the ISO/IEC 10646 character set, for example one not directly
1024;   accessible from available input devices.]"
1025; Therefore, we use a ucscode->char function to convert a character
1026; code into the character -- *regardless* of the current character
1027; encoding of the input stream.
1028
1029(define (ssax:read-char-ref port)
1030  (let* ((base
1031           (cond ((eqv? (peek-char port) #\x) (read-char port) 16)
1032                 (else 10)))
1033         (name (next-token '() '(#\;) "XML [66]" port))
1034         (char-code (string->number name base)))
1035    (read-char port)    ; read the terminating #\; char
1036    (if (integer? char-code) (ucscode->char char-code)
1037      (parser-error port "[wf-Legalchar] broken for '" name "'"))))
1038
1039
1040; procedure+: ssax:handle-parsed-entity PORT NAME ENTITIES
1041;               CONTENT-HANDLER STR-HANDLER SEED
1042;
1043; Expand and handle a parsed-entity reference
1044; port - a PORT
1045; name - the name of the parsed entity to expand, a symbol
1046; entities - see ENTITIES
1047; content-handler -- procedure PORT ENTITIES SEED
1048;       that is supposed to return a SEED
1049; str-handler - a STR-HANDLER. It is called if the entity in question
1050; turns out to be a pre-declared entity
1051;
1052; The result is the one returned by CONTENT-HANDLER or STR-HANDLER
1053; Faults detected:
1054;       WFC: XML-Spec.html#wf-entdeclared
1055;       WFC: XML-Spec.html#norecursion
1056
1057(define ssax:predefined-parsed-entities
1058  `(
1059    (,(string->symbol "amp") . "&")
1060    (,(string->symbol "lt") . "<")
1061    (,(string->symbol "gt") . ">")
1062    (,(string->symbol "apos") . "'")
1063    (,(string->symbol "quot") . "\"")))
1064
1065(define (ssax:handle-parsed-entity port name entities
1066                                   content-handler str-handler seed)
1067  (cond   ; First we check the list of the declared entities
1068   ((assq name entities) =>
1069    (lambda (decl-entity)
1070      (let ((ent-body (cdr decl-entity)) ; mark the list to prevent recursion
1071            (new-entities (cons (cons name #f) entities)))
1072        (cond
1073         ((string? ent-body)
1074          (call-with-input-string ent-body
1075             (lambda (port) (content-handler port new-entities seed))))
1076         ((procedure? ent-body)
1077          (let ((port (ent-body)))
1078            (begin0
1079             (content-handler port new-entities seed)
1080             (close-input-port port))))
1081         (else
1082          (parser-error port "[norecursion] broken for " name))))))
1083    ((assq name ssax:predefined-parsed-entities)
1084     => (lambda (decl-entity)
1085          (str-handler (cdr decl-entity) "" seed)))
1086    (else (parser-error port "[wf-entdeclared] broken for " name))))
1087
1088
1089
1090; procedure: make-empty-attlist
1091; The ATTLIST Abstract Data Type
1092; Currently is implemented as an assoc list sorted in the ascending
1093; order of NAMES.
1094
1095(define (make-empty-attlist) '())
1096
1097; procedure: attlist-add ATTLIST NAME-VALUE-PAIR
1098; Add a name-value pair to the existing attlist preserving the order
1099; Return the new list, in the sorted ascending order.
1100; Return #f if a pair with the same name already exists in the attlist
1101
1102(define (attlist-add attlist name-value)
1103  (if (null? attlist) (cons name-value attlist)
1104      (case (name-compare (car name-value) (caar attlist))
1105        ((=) #f)
1106        ((<) (cons name-value attlist))
1107        (else (cons (car attlist) (attlist-add (cdr attlist) name-value)))
1108        )))
1109
1110; procedure: attlist-null? ATTLIST
1111(define attlist-null? null?)
1112
1113; procedure: attlist-remove-top ATTLIST
1114; Given an non-null attlist, return a pair of values: the top and the rest
1115(define (attlist-remove-top attlist)
1116  (values (car attlist) (cdr attlist)))
1117
1118; procedure: attliast->alist
1119(define (attlist->alist attlist) attlist)
1120; procedure: attlist-fold
1121(define attlist-fold fold)
1122
1123; procedure+:   ssax:read-attributes PORT ENTITIES
1124;
1125; This procedure reads and parses a production Attribute*
1126; [41] Attribute ::= Name Eq AttValue
1127; [10] AttValue ::=  '"' ([^<&"] | Reference)* '"'
1128;                 | "'" ([^<&'] | Reference)* "'"
1129; [25] Eq ::= S? '=' S?
1130;
1131;
1132; The procedure returns an ATTLIST, of Name (as UNRES-NAME), Value (as string)
1133; pairs. The current character on the PORT is a non-whitespace character
1134; that is not an ncname-starting character.
1135;
1136; Note the following rules to keep in mind when reading an 'AttValue'
1137; "Before the value of an attribute is passed to the application
1138; or checked for validity, the XML processor must normalize it as follows:
1139; - a character reference is processed by appending the referenced
1140;   character to the attribute value
1141; - an entity reference is processed by recursively processing the
1142;   replacement text of the entity [see ENTITIES]
1143;   [named entities amp lt gt quot apos are assumed pre-declared]
1144; - a whitespace character (#x20, #xD, #xA, #x9) is processed by appending #x20
1145;   to the normalized value, except that only a single #x20 is appended for a
1146;   "#xD#xA" sequence that is part of an external parsed entity or the
1147;   literal entity value of an internal parsed entity
1148; - other characters are processed by appending them to the normalized value "
1149;
1150;
1151; Faults detected:
1152;       WFC: XML-Spec.html#CleanAttrVals
1153;       WFC: XML-Spec.html#uniqattspec
1154
1155(define ssax:read-attributes  ; ssax:read-attributes port entities
1156 (let ((value-delimeters (append ssax:S-chars '(#\< #\&))))
1157                ; Read the AttValue from the PORT up to the delimiter
1158                ; (which can be a single or double-quote character,
1159                ; or even a symbol *eof*)
1160                ; 'prev-fragments' is the list of string fragments, accumulated
1161                ; so far, in reverse order.
1162                ; Return the list of fragments with newly read fragments
1163                ; prepended.
1164  (define (read-attrib-value delimiter port entities prev-fragments)
1165    (let* ((new-fragments
1166            (cons (next-token '() (cons delimiter value-delimeters)
1167                              "XML [10]" port)
1168             prev-fragments))
1169           (cterm (read-char port)))
1170      (cond
1171        ((or (eof-object? cterm) (eqv? cterm delimiter))
1172          new-fragments)
1173        ((eqv? cterm char-return)       ; treat a CR and CRLF as a LF
1174          (if (eqv? (peek-char port) #\newline) (read-char port))
1175          (read-attrib-value delimiter port entities
1176                             (cons " " new-fragments)))
1177        ((memv cterm ssax:S-chars)
1178          (read-attrib-value delimiter port entities
1179                             (cons " " new-fragments)))
1180        ((eqv? cterm #\&)
1181          (cond
1182            ((eqv? (peek-char port) #\#)
1183              (read-char port)
1184              (read-attrib-value delimiter port entities
1185                (cons (string (ssax:read-char-ref port)) new-fragments)))
1186            (else
1187              (read-attrib-value delimiter port entities
1188                (read-named-entity port entities new-fragments)))))
1189        (else (parser-error port "[CleanAttrVals] broken")))))
1190
1191                ; we have read "&" that introduces a named entity reference.
1192                ; read this reference and return the result of
1193                ; normalizing of the corresponding string
1194                ; (that is, read-attrib-value is applied to the replacement
1195                ; text of the entity)
1196                ; The current position will be after ";" that terminates
1197                ; the entity reference
1198  (define (read-named-entity port entities fragments)
1199    (let ((name (ssax:read-NCName port)))
1200      (assert-curr-char '(#\;) "XML [68]" port)
1201      (ssax:handle-parsed-entity port name entities
1202        (lambda (port entities fragments)
1203          (read-attrib-value '*eof* port entities fragments))
1204        (lambda (str1 str2 fragments)
1205          (if (equal? "" str2) (cons str1 fragments)
1206              (cons* str2 str1 fragments)))
1207        fragments)))
1208
1209  (lambda (port entities)
1210    (let loop ((attr-list (make-empty-attlist)))
1211      (if (not (ssax:ncname-starting-char? (ssax:skip-S port))) attr-list
1212          (let ((name (ssax:read-QName port)))
1213            (ssax:skip-S port)
1214            (assert-curr-char '(#\=) "XML [25]" port)
1215            (ssax:skip-S port)
1216            (let ((delimiter 
1217                   (assert-curr-char '(#\' #\" ) "XML [10]" port)))
1218              (loop 
1219               (or (attlist-add attr-list 
1220                     (cons name 
1221                           (string-concatenate-reverse/shared
1222                             (read-attrib-value delimiter port entities
1223                                                      '()))))
1224                   (parser-error port "[uniqattspec] broken for " name))))))))
1225))
1226
1227; a few lines of validation code
1228(run-test (letrec
1229    ((test (lambda (str decl-entities expected-res)
1230             (newline) (display "input: ") (write str)
1231             (newline) (display "Result: ")
1232             (let ((result
1233                     (call-with-input-string (unesc-string str)
1234                       (lambda (port)
1235                         (ssax:read-attributes port decl-entities)))))
1236               (write result) (newline)
1237               (assert (equal? result expected-res))))))
1238    (test "" '() '())
1239    (test "href='http://a%tb%r%n%r%n%nc'" '()
1240          `((,(string->symbol "href") . "http://a b   c")))
1241    (test "href='http://a%tb%r%r%n%rc'" '()
1242          `((,(string->symbol "href") . "http://a b   c")))
1243    (test "_1 ='12&amp;' _2= \"%r%n%t12&#10;3\">" '()
1244          `((_1 . "12&") (_2 . ,(unesc-string "  12%n3"))))
1245    (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1246          '((ent . "&lt;xx&gt;"))
1247          `((,(string->symbol "Abc") . ,(unesc-string "<&>%n"))
1248            (,(string->symbol "Next") . "12<xx>34")))
1249    (test "%tAbc='&lt;&amp;&gt;&#x0d;'%nNext='12&ent;34' />" 
1250          '((ent . "&lt;xx&gt;"))
1251          `((,(string->symbol "Abc") . ,(unesc-string "<&>%r"))
1252            (,(string->symbol "Next") . "12<xx>34")))
1253    (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&en;34' />" 
1254          `((en . ,(lambda () (open-input-string "&quot;xx&apos;"))))
1255          `((,(string->symbol "Abc") . ,(unesc-string "<&>%n"))
1256            (,(string->symbol "Next") . "12\"xx'34")))
1257    (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1258          '((ent . "&lt;&ent1;T;&gt;") (ent1 . "&amp;"))
1259          `((,(string->symbol "Abc") . ,(unesc-string "<&>%n"))
1260            (,(string->symbol "Next") . "12<&T;>34")))
1261    (assert (failed?
1262        (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1263          '((ent . "<&ent1;T;&gt;") (ent1 . "&amp;")) '())))
1264    (assert (failed?
1265        (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1266          '((ent . "&lt;&ent;T;&gt;") (ent1 . "&amp;")) '())))
1267    (assert (failed?
1268        (test "%tAbc='&lt;&amp;&gt;&#x0A;'%nNext='12&ent;34' />" 
1269          '((ent . "&lt;&ent1;T;&gt;") (ent1 . "&ent;")) '())))
1270    (test "html:href='http://a%tb%r%n%r%n%nc'" '()
1271          `(((,(string->symbol "html") . ,(string->symbol "href"))
1272             . "http://a b   c")))
1273    (test "html:href='ref1' html:src='ref2'" '()
1274          `(((,(string->symbol "html") . ,(string->symbol "href"))
1275             . "ref1")
1276            ((,(string->symbol "html") . ,(string->symbol "src"))
1277             . "ref2")))
1278    (test "html:href='ref1' xml:html='ref2'" '()
1279          `(((,(string->symbol "html") . ,(string->symbol "href"))
1280             . "ref1")
1281            ((,ssax:Prefix-XML . ,(string->symbol "html"))
1282             . "ref2")))
1283    (assert (failed? (test "html:href='ref1' html:href='ref2'" '() '())))
1284    (assert (failed? (test "html:href='<' html:href='ref2'" '() '())))
1285    (assert (failed? (test "html:href='ref1' html:href='&ref2;'" '() '())))
1286))
1287
1288; ssax:resolve-name PORT UNRES-NAME NAMESPACES apply-default-ns?
1289;
1290; Convert an UNRES-NAME to a RES-NAME given the appropriate NAMESPACES
1291; declarations.
1292; the last parameter apply-default-ns? determines if the default
1293; namespace applies (for instance, it does not for attribute names)
1294;
1295; Per REC-xml-names/#nsc-NSDeclared, "xml" prefix is considered pre-declared
1296; and bound to the namespace name "http://www.w3.org/XML/1998/namespace".
1297;
1298; This procedure tests for the namespace constraints:
1299; http://www.w3.org/TR/REC-xml-names/#nsc-NSDeclared
1300
1301(define (ssax:resolve-name port unres-name namespaces apply-default-ns?)
1302  (cond
1303   ((pair? unres-name)          ; it's a QNAME
1304    (cons
1305     (cond
1306     ((assq (car unres-name) namespaces) => cadr)
1307     ((eq? (car unres-name) ssax:Prefix-XML) ssax:Prefix-XML)
1308     (else
1309      (parser-error port "[nsc-NSDeclared] broken; prefix " (car unres-name))))
1310     (cdr unres-name)))
1311   (apply-default-ns?           ; Do apply the default namespace, if any
1312    (let ((default-ns (assq '*DEFAULT* namespaces)))
1313      (if (and default-ns (cadr default-ns))
1314          (cons (cadr default-ns) unres-name)
1315          unres-name)))         ; no default namespace declared
1316   (else unres-name)))          ; no prefix, don't apply the default-ns
1317           
1318         
1319(run-test
1320 (let* ((namespaces
1321        '((HTML UHTML . URN-HTML)
1322          (HTML UHTML-1 . URN-HTML)
1323          (A    UHTML . URN-HTML)))
1324        (namespaces-def
1325         (cons
1326          '(*DEFAULT* DEF . URN-DEF) namespaces))
1327        (namespaces-undef
1328         (cons
1329          '(*DEFAULT* #f . #f) namespaces-def))
1330        (port (current-input-port)))
1331
1332   (assert (equal? 'ABC 
1333                   (ssax:resolve-name port 'ABC namespaces #t)))
1334   (assert (equal? '(DEF . ABC)
1335                   (ssax:resolve-name port 'ABC namespaces-def #t)))
1336   (assert (equal? 'ABC
1337                   (ssax:resolve-name port 'ABC namespaces-def #f)))
1338   (assert (equal? 'ABC
1339                   (ssax:resolve-name port 'ABC namespaces-undef #t)))
1340   (assert (equal? '(UHTML . ABC)
1341                   (ssax:resolve-name port '(HTML . ABC) namespaces-def #t)))
1342   (assert (equal? '(UHTML . ABC)
1343                   (ssax:resolve-name port '(HTML . ABC) namespaces-def #f)))
1344   (assert (equal? `(,ssax:Prefix-XML . space)
1345                   (ssax:resolve-name port 
1346                       `(,(string->symbol "xml") . space) namespaces-def #f)))
1347   (assert (failed?
1348                   (ssax:resolve-name port '(XXX . ABC) namespaces-def #f)))
1349))
1350
1351
1352; procedure+: ssax:uri-string->symbol URI-STR
1353; Convert a URI-STR to an appropriate symbol
1354(define (ssax:uri-string->symbol uri-str)
1355  (string->symbol uri-str))
1356
1357; procedure+: ssax:complete-start-tag TAG PORT ELEMS ENTITIES NAMESPACES
1358;
1359; This procedure is to complete parsing of a start-tag markup. The
1360; procedure must be called after the start tag token has been
1361; read. TAG is an UNRES-NAME. ELEMS is an instance of xml-decl::elems;
1362; it can be #f to tell the function to do _no_ validation of elements
1363; and their attributes.
1364;
1365; This procedure returns several values:
1366;  ELEM-GI: a RES-NAME.
1367;  ATTRIBUTES: element's attributes, an ATTLIST of (RES-NAME . STRING)
1368;       pairs. The list does NOT include xmlns attributes.
1369;  NAMESPACES: the input list of namespaces amended with namespace
1370;       (re-)declarations contained within the start-tag under parsing
1371;  ELEM-CONTENT-MODEL
1372
1373; On exit, the current position in PORT will be the first character after
1374; #\> that terminates the start-tag markup.
1375;
1376; Faults detected:
1377;       VC: XML-Spec.html#enum
1378;       VC: XML-Spec.html#RequiredAttr
1379;       VC: XML-Spec.html#FixedAttr
1380;       VC: XML-Spec.html#ValueType
1381;       WFC: XML-Spec.html#uniqattspec (after namespaces prefixes are resolved)
1382;       VC: XML-Spec.html#elementvalid
1383;       WFC: REC-xml-names/#dt-NSName
1384
1385; Note, although XML Recommendation does not explicitly say it,
1386; xmlns and xmlns: attributes don't have to be declared (although they
1387; can be declared, to specify their default value)
1388
1389; Procedure:  ssax:complete-start-tag tag-head port elems entities namespaces
1390(define ssax:complete-start-tag
1391
1392 (let ((xmlns (string->symbol "xmlns"))
1393       (largest-dummy-decl-attr (list ssax:largest-unres-name #f #f #f)))
1394
1395  ; Scan through the attlist and validate it, against decl-attrs
1396  ; Return an assoc list with added fixed or implied attrs.
1397  ; Note that both attlist and decl-attrs are ATTLISTs, and therefore,
1398  ; sorted
1399  (define (validate-attrs port attlist decl-attrs)
1400
1401    ; Check to see decl-attr is not of use type REQUIRED. Add
1402    ; the association with the default value, if any declared
1403    (define (add-default-decl decl-attr result)
1404      (let*-values
1405         (((attr-name content-type use-type default-value)
1406           (apply values decl-attr)))
1407         (and (eq? use-type 'REQUIRED)
1408              (parser-error port "[RequiredAttr] broken for" attr-name))
1409         (if default-value
1410             (cons (cons attr-name default-value) result)
1411             result)))
1412
1413    (let loop ((attlist attlist) (decl-attrs decl-attrs) (result '()))
1414      (if (attlist-null? attlist)
1415          (attlist-fold add-default-decl result decl-attrs)
1416          (let*-values
1417           (((attr attr-others)
1418             (attlist-remove-top attlist))
1419            ((decl-attr other-decls)
1420             (if (attlist-null? decl-attrs)
1421                 (values largest-dummy-decl-attr decl-attrs)
1422                 (attlist-remove-top decl-attrs)))
1423            )
1424           (case (name-compare (car attr) (car decl-attr))
1425             ((<) 
1426              (if (or (eq? xmlns (car attr))
1427                      (and (pair? (car attr)) (eq? xmlns (caar attr))))
1428                  (loop attr-others decl-attrs (cons attr result))
1429                  (parser-error port "[ValueType] broken for " attr)))
1430             ((>) 
1431              (loop attlist other-decls 
1432                    (add-default-decl decl-attr result)))
1433             (else      ; matched occurrence of an attr with its declaration
1434              (let*-values
1435               (((attr-name content-type use-type default-value)
1436                 (apply values decl-attr)))
1437               ; Run some tests on the content of the attribute
1438               (cond
1439                ((eq? use-type 'FIXED)
1440                 (or (equal? (cdr attr) default-value)
1441                     (parser-error port "[FixedAttr] broken for " attr-name)))
1442                ((eq? content-type 'CDATA) #t) ; everything goes
1443                ((pair? content-type)
1444                 (or (member (cdr attr) content-type)
1445                     (parser-error port "[enum] broken for " attr-name "="
1446                            (cdr attr))))
1447                (else
1448                 (ssax:warn port "declared content type " content-type
1449                       " not verified yet")))
1450               (loop attr-others other-decls (cons attr result)))))
1451           ))))
1452           
1453
1454  ; Add a new namespace declaration to namespaces.
1455  ; First we convert the uri-str to a uri-symbol and search namespaces for
1456  ; an association (_ user-prefix . uri-symbol).
1457  ; If found, we return the argument namespaces with an association
1458  ; (prefix user-prefix . uri-symbol) prepended.
1459  ; Otherwise, we prepend (prefix uri-symbol . uri-symbol)
1460  (define (add-ns port prefix uri-str namespaces)
1461    (and (equal? "" uri-str)
1462         (parser-error port "[dt-NSName] broken for " prefix))
1463    (let ((uri-symbol (ssax:uri-string->symbol uri-str)))
1464      (let loop ((nss namespaces))
1465        (cond
1466         ((null? nss)
1467          (cons (cons* prefix uri-symbol uri-symbol) namespaces))
1468         ((eq? uri-symbol (cddar nss))
1469          (cons (cons* prefix (cadar nss) uri-symbol) namespaces))
1470         (else (loop (cdr nss)))))))
1471     
1472  ; partition attrs into proper attrs and new namespace declarations
1473  ; return two values: proper attrs and the updated namespace declarations
1474  (define (adjust-namespace-decl port attrs namespaces)
1475    (let loop ((attrs attrs) (proper-attrs '()) (namespaces namespaces))
1476      (cond
1477       ((null? attrs) (values proper-attrs namespaces))
1478       ((eq? xmlns (caar attrs))        ; re-decl of the default namespace
1479        (loop (cdr attrs) proper-attrs 
1480              (if (equal? "" (cdar attrs))      ; un-decl of the default ns
1481                  (cons (cons* '*DEFAULT* #f #f) namespaces)
1482                  (add-ns port '*DEFAULT* (cdar attrs) namespaces))))
1483       ((and (pair? (caar attrs)) (eq? xmlns (caaar attrs)))
1484        (loop (cdr attrs) proper-attrs
1485              (add-ns port (cdaar attrs) (cdar attrs) namespaces)))
1486       (else
1487        (loop (cdr attrs) (cons (car attrs) proper-attrs) namespaces)))))
1488
1489    ; The body of the function
1490 (lambda (tag-head port elems entities namespaces)
1491  (let*-values
1492   (((attlist) (ssax:read-attributes port entities))
1493    ((empty-el-tag?)
1494     (begin
1495       (ssax:skip-S port)
1496       (and
1497        (eqv? #\/ 
1498              (assert-curr-char '(#\> #\/) "XML [40], XML [44], no '>'" port))
1499        (assert-curr-char '(#\>) "XML [44], no '>'" port))))
1500    ((elem-content decl-attrs)  ; see xml-decl for their type
1501     (if elems                  ; elements declared: validate!
1502         (cond
1503          ((assoc tag-head elems) =>
1504           (lambda (decl-elem)          ; of type xml-decl::decl-elem
1505             (values
1506              (if empty-el-tag? 'EMPTY-TAG (cadr decl-elem))
1507              (caddr decl-elem))))
1508          (else
1509           (parser-error port "[elementvalid] broken, no decl for " tag-head)))
1510         (values                ; non-validating parsing
1511          (if empty-el-tag? 'EMPTY-TAG 'ANY)
1512          #f)                   ; no attributes declared
1513         ))
1514    ((merged-attrs) (if decl-attrs (validate-attrs port attlist decl-attrs)
1515                      (attlist->alist attlist)))
1516    ((proper-attrs namespaces)
1517     (adjust-namespace-decl port merged-attrs namespaces))
1518    )
1519   ;(cerr "proper attrs: " proper-attrs nl)
1520   ; build the return value
1521   (values
1522    (ssax:resolve-name port tag-head namespaces #t)
1523    (fold-right
1524     (lambda (name-value attlist)
1525       (or
1526        (attlist-add attlist
1527           (cons (ssax:resolve-name port (car name-value) namespaces #f)
1528                 (cdr name-value)))
1529        (parser-error port "[uniqattspec] after NS expansion broken for " 
1530               name-value)))
1531     (make-empty-attlist)
1532     proper-attrs)
1533    namespaces
1534    elem-content)))))
1535
1536(run-test
1537 (let* ((urn-a (string->symbol "urn:a"))
1538        (urn-b (string->symbol "urn:b"))
1539        (urn-html (string->symbol "http://w3c.org/html"))
1540        (namespaces
1541         `((#f '"UHTML" . ,urn-html)
1542           ('"A"  '"UA" . ,urn-a)))
1543          (test
1544           (lambda (tag-head-name elems str)
1545             (call-with-input-string str
1546                (lambda (port)
1547                  (call-with-values
1548                      (lambda ()
1549                              (ssax:complete-start-tag
1550                               (call-with-input-string tag-head-name
1551                                      (lambda (port) (ssax:read-QName port)))
1552                               port
1553                               elems '() namespaces))
1554                    list))))))
1555
1556   ; First test with no validation of elements
1557   ;(test "TAG1" #f "")
1558   (assert (equal? `('"TAG1" () ,namespaces ANY)
1559                   (test "TAG1" #f ">")))
1560   (assert (equal? `('"TAG1" () ,namespaces EMPTY-TAG)
1561                   (test "TAG1" #f "/>")))
1562   (assert (equal? `('"TAG1" (('"HREF" . "a")) ,namespaces EMPTY-TAG)
1563                   (test "TAG1" #f "HREF='a'/>")))
1564   (assert (equal? `(('"UA" . '"TAG1") (('"HREF" . "a"))
1565                     ,(cons `(*DEFAULT* '"UA" . ,urn-a) namespaces) ANY)
1566                   (test "TAG1" #f "HREF='a' xmlns='urn:a'>")))
1567   (assert (equal? `('"TAG1" (('"HREF" . "a"))
1568                     ,(cons '(*DEFAULT* #f . #f) namespaces) ANY)
1569                   (test "TAG1" #f "HREF='a' xmlns=''>")))
1570   (assert (failed? (test "UA:TAG1" #f "HREF='a' xmlns=''/>")))
1571   (assert (equal? `(('"UA" . '"TAG1") ((('"UA" . '"HREF") . "a"))
1572                     ,(cons '(*DEFAULT* #f . #f) namespaces) ANY)
1573                   (test "A:TAG1" #f "A:HREF='a' xmlns=''>")))
1574   (assert (equal? `(('"UA" . '"TAG1") ((('"UA" . '"HREF") . "a"))
1575                     ,(cons `(*DEFAULT* ,urn-b . ,urn-b) namespaces) ANY)
1576                   (test "A:TAG1" #f "A:HREF='a' xmlns='urn:b'>")))
1577   (assert (failed? (test "B:TAG1" #f "A:HREF='a' xmlns:b=''/>")))
1578   (assert (equal? `((,urn-b . '"TAG1") ((('"UA" . '"HREF") . "a"))
1579                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) ANY)
1580                   (test "B:TAG1" #f "A:HREF='a' xmlns:B='urn:b'>")))
1581   (assert (equal? `((,urn-b . '"TAG1") ((('"UA" . '"HREF") . "a")
1582                                         ((,urn-b . '"SRC") . "b"))
1583                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) ANY)
1584                   (test "B:TAG1" #f 
1585                         "B:SRC='b' A:HREF='a' xmlns:B='urn:b'>")))
1586   (assert (equal? `((,urn-b . '"TAG1") ((('"UA" . '"HREF") . "a")
1587                                         ((,urn-b . '"HREF") . "b"))
1588                          ,(cons `('"B" ,urn-b . ,urn-b) namespaces) ANY)
1589                   (test "B:TAG1" #f 
1590                         "B:HREF=\"b\" A:HREF='a' xmlns:B='urn:b'>")))
1591   ; must be an error! Duplicate attr
1592   (assert (failed? (test "B:TAG1" #f
1593                          "HREF=\"b\" HREF='a' xmlns:B='urn:a'/>")))
1594   ; must be an error! Duplicate attr after ns expansion
1595   (assert (failed? (test "B:TAG1" #f 
1596                          "B:HREF=\"b\" A:HREF='a' xmlns:B='urn:a'/>")))
1597   (assert (equal? `(('"UA" . '"TAG1") (('"HREF" . "a")
1598                                        (('"UA" . '"HREF") . "b"))
1599                     ,(cons `(*DEFAULT* '"UA" . ,urn-a) namespaces) ANY)
1600                   (test "TAG1" #f 
1601                         "A:HREF=\"b\" HREF='a' xmlns='urn:a'>")))
1602   (assert (equal? `('"TAG1" ((('"UHTML" . '"HREF") . "a")
1603                              ((,urn-b . '"HREF") . "b"))
1604                     ,(append `(
1605                         ('"HTML" '"UHTML" . ,urn-html)
1606                         ('"B" ,urn-b . ,urn-b))
1607                              namespaces) ANY)
1608                   (test "TAG1" #f 
1609                         "B:HREF=\"b\" xmlns:B='urn:b' xmlns:HTML='http://w3c.org/html' HTML:HREF='a' >")))
1610
1611   ; Now test the validating parsing
1612   ; No decl for tag1
1613   (assert (failed? (test "TAG1" '((TAG2 ANY ()))
1614                          "B:HREF='b' xmlns:B='urn:b'>")))
1615   ; No decl for HREF elem
1616;;   (cond-expand
1617;;    ((not (or scm mit-scheme))        ; Regretfully, SCM treats '() as #f
1618;;     (assert (failed?
1619;;            (test "TAG1" '(('"TAG1" ANY ()))
1620;;                  "B:HREF='b' xmlns:B='urn:b'>"))))
1621;;    (else #t))
1622   ; No decl for HREF elem
1623   (assert (failed?
1624            (test "TAG1" '(('"TAG1" ANY (('"HREF1" CDATA IMPLIED #f))))
1625            "B:HREF='b' xmlns:B='urn:b'>")))
1626   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces EMPTY-TAG)
1627       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f))))
1628             "HREF='b'/>")))
1629   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1630       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f))))
1631             "HREF='b'>")))
1632   ; Req'd attribute not given error
1633   (assert (failed? 
1634            (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f))))
1635                  ">")))
1636   ; Wrong content-type of the attribute
1637   (assert (failed? 
1638       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" ("c") REQUIRED #f))))
1639             "HREF='b'>")))
1640   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1641       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" ("c" "b") IMPLIED #f))))
1642             "HREF='b'>")))
1643   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1644       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA IMPLIED "c"))))
1645             "HREF='b'>")))
1646   ; Bad fixed attribute
1647   (assert (failed? 
1648         (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA FIXED "c"))))
1649               "HREF='b'>")))
1650   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1651       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA FIXED "b"))))
1652             "HREF='b'>")))
1653   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1654       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA FIXED "b")))) ">")))
1655   (assert (equal? `('"TAG1" (('"HREF" . "b")) ,namespaces PCDATA)
1656       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA IMPLIED "b")))) ">")))
1657   (assert (equal? `('"TAG1" () ,namespaces PCDATA)
1658       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA IMPLIED #f)))) ">")))
1659   ; Undeclared attr
1660   (assert (failed? 
1661        (test "TAG1"
1662              '(('"TAG1" PCDATA ((('"A" . '"HREF") CDATA IMPLIED "c"))))
1663              "HREF='b'>")))
1664   (assert (equal? `('"TAG1" (('"HREF" . "b") (('"UA" . '"HREF") . "c"))
1665                          ,namespaces PCDATA)
1666       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f)
1667                                       (('"A" . '"HREF") CDATA IMPLIED "c"))))
1668             "HREF='b'>")))
1669   (assert (equal? `(('"UA" . '"TAG1")
1670                     (('"HREF" . "b") (('"UA" . '"HREF") . "c"))
1671                     ,namespaces PCDATA)
1672       (test "A:TAG1" '((('"A" . '"TAG1") PCDATA
1673                         (('"HREF" NMTOKEN REQUIRED #f)
1674                          (('"A" . '"HREF") CDATA IMPLIED "c"))))
1675             "HREF='b'>")))
1676   (assert (equal? `((,urn-b . '"TAG1") (('"HREF" . "b"))
1677                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) PCDATA)
1678       (test "B:TAG1" '((('"B" . '"TAG1") PCDATA (('"HREF" CDATA REQUIRED #f)
1679                           (('"xmlns" . '"B") CDATA IMPLIED "urn:b"))))
1680             "HREF='b'>")))
1681   (assert (equal? `((,urn-b . '"TAG1") (((,urn-b . '"HREF") . "b"))
1682                          ,(cons `('"B" ,urn-b . ,urn-b) namespaces) PCDATA)
1683       (test "B:TAG1" '((('"B" . '"TAG1") PCDATA
1684                         ((('"B" . '"HREF") CDATA REQUIRED #f)
1685                          (('"xmlns" . '"B") CDATA IMPLIED "urn:b"))))
1686             "B:HREF='b'>")))
1687   (assert (equal? `((,urn-b . '"TAG1") (('"HREF" . "b"))
1688                     ,(cons `(*DEFAULT* ,urn-b . ,urn-b) namespaces) PCDATA)
1689       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f)
1690                           ('"xmlns" CDATA IMPLIED "urn:b"))))
1691             "HREF='b'>")))
1692   ; xmlns not declared
1693   (assert (equal? `((,urn-b . '"TAG1") (('"HREF" . "b"))
1694                     ,(cons `(*DEFAULT* ,urn-b . ,urn-b) namespaces) PCDATA)
1695       (test "TAG1" '(('"TAG1" PCDATA (('"HREF" CDATA REQUIRED #f)
1696                           )))
1697             "HREF='b' xmlns='urn:b'>")))
1698   ; xmlns:B not declared
1699   (assert (equal? `((,urn-b . '"TAG1") (((,urn-b . '"HREF") . "b"))
1700                     ,(cons `('"B" ,urn-b . ,urn-b) namespaces) PCDATA)
1701       (test "B:TAG1" '((('"B" . '"TAG1") PCDATA
1702                         ((('"B" . '"HREF") CDATA REQUIRED #f)
1703                           )))
1704             "B:HREF='b' xmlns:B='urn:b'>")))
1705))
1706
1707; procedure+: ssax:read-external-id PORT
1708;
1709; This procedure parses an ExternalID production:
1710; [75] ExternalID ::= 'SYSTEM' S SystemLiteral
1711;               | 'PUBLIC' S PubidLiteral S SystemLiteral
1712; [11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
1713; [12] PubidLiteral ::=  '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
1714; [13] PubidChar ::=  #x20 | #xD | #xA | [a-zA-Z0-9]
1715;                         | [-'()+,./:=?;!*#@$_%]
1716;
1717; This procedure is supposed to be called when an ExternalID is expected;
1718; that is, the current character must be either #\S or #\P that start
1719; correspondingly a SYSTEM or PUBLIC token. This procedure returns the
1720; SystemLiteral as a string. A PubidLiteral is disregarded if present.
1721 
1722(define (ssax:read-external-id port)
1723  (let ((discriminator (ssax:read-NCName port)))
1724    (assert-curr-char ssax:S-chars "space after SYSTEM or PUBLIC" port)
1725    (ssax:skip-S port)
1726    (let ((delimiter 
1727          (assert-curr-char '(#\' #\" ) "XML [11], XML [12]" port)))
1728      (cond
1729        ((eq? discriminator (string->symbol "SYSTEM"))
1730          (begin0
1731            (next-token '() (list delimiter) "XML [11]" port)
1732            (read-char port)    ; reading the closing delim
1733            ))
1734         ((eq? discriminator (string->symbol "PUBLIC"))
1735           (skip-until (list delimiter) port)
1736           (assert-curr-char ssax:S-chars "space after PubidLiteral" port)
1737           (ssax:skip-S port)
1738           (let* ((delimiter 
1739                  (assert-curr-char '(#\' #\" ) "XML [11]" port))
1740                  (systemid
1741                    (next-token '() (list delimiter) "XML [11]" port)))
1742                (read-char port)        ; reading the closing delim
1743                systemid))
1744         (else
1745           (parser-error port "XML [75], " discriminator 
1746                  " rather than SYSTEM or PUBLIC"))))))
1747
1748
1749;-----------------------------------------------------------------------------
1750;                       Higher-level parsers and scanners
1751;
1752; They parse productions corresponding to the whole (document) entity
1753; or its higher-level pieces (prolog, root element, etc).
1754
1755
1756; Scan the Misc production in the context
1757; [1]  document ::=  prolog element Misc*
1758; [22] prolog ::= XMLDecl? Misc* (doctypedec l Misc*)?
1759; [27] Misc ::= Comment | PI |  S
1760;
1761; The following function should be called in the prolog or epilog contexts.
1762; In these contexts, whitespaces are completely ignored.
1763; The return value from ssax:scan-Misc is either a PI-token,
1764; a DECL-token, a START token, or EOF.
1765; Comments are ignored and not reported.
1766
1767(define (ssax:scan-Misc port)
1768  (let loop ((c (ssax:skip-S port)))
1769    (cond
1770      ((eof-object? c) c)
1771      ((not (char=? c #\<))
1772        (parser-error port "XML [22], char '" c "' unexpected"))
1773      (else
1774        (let ((token (ssax:read-markup-token port)))
1775          (case (xml-token-kind token)
1776            ((COMMENT) (loop (ssax:skip-S port)))
1777            ((PI DECL START) token)
1778            (else
1779              (parser-error port "XML [22], unexpected token of kind "
1780                     (xml-token-kind token)
1781                     ))))))))
1782
1783; procedure+: ssax:read-char-data PORT EXPECT-EOF? STR-HANDLER SEED
1784;
1785; This procedure is to read the character content of an XML document
1786; or an XML element.
1787; [43] content ::=
1788;       (element | CharData | Reference | CDSect | PI
1789;       | Comment)*
1790; To be more precise, the procedure reads CharData, expands CDSect
1791; and character entities, and skips comments. The procedure stops
1792; at a named reference, EOF, at the beginning of a PI or a start/end tag.
1793;
1794; port
1795;       a PORT to read
1796; expect-eof?
1797;       a boolean indicating if EOF is normal, i.e., the character
1798;       data may be terminated by the EOF. EOF is normal
1799;       while processing a parsed entity.
1800; str-handler
1801;       a STR-HANDLER
1802; seed
1803;       an argument passed to the first invocation of STR-HANDLER.
1804;
1805; The procedure returns two results: SEED and TOKEN.
1806; The SEED is the result of the last invocation of STR-HANDLER, or the
1807; original seed if STR-HANDLER was never called.
1808;
1809; TOKEN can be either an eof-object (this can happen only if
1810; expect-eof? was #t), or:
1811;     - an xml-token describing a START tag or an END-tag;
1812;       For a start token, the caller has to finish reading it.
1813;     - an xml-token describing the beginning of a PI. It's up to an
1814;       application to read or skip through the rest of this PI;
1815;     - an xml-token describing a named entity reference.
1816;
1817; CDATA sections and character references are expanded inline and
1818; never returned. Comments are silently disregarded.
1819;
1820; As the XML Recommendation requires, all whitespace in character data
1821; must be preserved. However, a CR character (#xD) must be disregarded
1822; if it appears before a LF character (#xA), or replaced by a #xA character
1823; otherwise. See Secs. 2.10 and 2.11 of the XML Recommendation. See also
1824; the canonical XML Recommendation.
1825
1826        ; ssax:read-char-data port expect-eof? str-handler seed
1827(define ssax:read-char-data
1828 (let
1829     ((terminators-usual (list #\< #\& char-return))
1830      (terminators-usual-eof (list #\< '*eof* #\& char-return))
1831
1832      (handle-fragment
1833       (lambda (fragment str-handler seed)
1834         (if (string-null? fragment) seed
1835             (str-handler fragment "" seed))))
1836      )
1837
1838   (lambda (port expect-eof? str-handler seed)
1839
1840     ; Very often, the first character we encounter is #\<
1841     ; Therefore, we handle this case in a special, fast path
1842     (if (eqv? #\< (peek-char port))
1843
1844         ; The fast path
1845         (let ((token (ssax:read-markup-token port)))
1846           (case (xml-token-kind token)
1847             ((START END)       ; The most common case
1848              (values seed token))
1849             ((CDSECT)
1850              (let ((seed (ssax:read-cdata-body port str-handler seed)))
1851                (ssax:read-char-data port expect-eof? str-handler seed)))
1852             ((COMMENT) (ssax:read-char-data port expect-eof?
1853                                             str-handler seed))
1854             (else
1855              (values seed token))))
1856
1857
1858         ; The slow path
1859         (let ((char-data-terminators
1860                (if expect-eof? terminators-usual-eof terminators-usual)))
1861
1862           (let loop ((seed seed))
1863             (let* ((fragment
1864                     (next-token '() char-data-terminators 
1865                                 "reading char data" port))
1866                    (term-char (peek-char port)) ; one of char-data-terminators
1867                    )
1868               (if (eof-object? term-char)
1869                   (values
1870                    (handle-fragment fragment str-handler seed)
1871                    term-char)
1872                   (case term-char
1873                     ((#\<)
1874                      (let ((token (ssax:read-markup-token port)))
1875                        (case (xml-token-kind token)
1876                          ((CDSECT)
1877                           (loop
1878                            (ssax:read-cdata-body port str-handler
1879                                (handle-fragment fragment str-handler seed))))
1880                          ((COMMENT)
1881                           (loop (handle-fragment fragment str-handler seed)))
1882                          (else
1883                           (values
1884                            (handle-fragment fragment str-handler seed)
1885                            token)))))
1886                     ((#\&)
1887                      (case (peek-next-char port)
1888                        ((#\#) (read-char port) 
1889                         (loop (str-handler fragment
1890                                       (string (ssax:read-char-ref port))
1891                                       seed)))
1892                        (else
1893                         (let ((name (ssax:read-NCName port)))
1894                           (assert-curr-char '(#\;) "XML [68]" port)
1895                           (values
1896                            (handle-fragment fragment str-handler seed)
1897                            (make-xml-token 'ENTITY-REF name))))))
1898                     (else              ; This must be a CR character
1899                      (if (eqv? (peek-next-char port) #\newline)
1900                          (read-char port))
1901                      (loop (str-handler fragment (string #\newline) seed))))
1902                   ))))))))
1903
1904
1905; a few lines of validation code
1906(run-test (letrec
1907  ((a-tag (make-xml-token 'START (string->symbol "BR")))
1908   (a-ref (make-xml-token 'ENTITY-REF (string->symbol "lt")))
1909   (eof-object (lambda () eof-object)) ; a unique value
1910   (str-handler (lambda (fragment foll-fragment seed)
1911     (if (string-null? foll-fragment) (cons fragment seed)
1912         (cons* foll-fragment fragment seed))))
1913   (test (lambda (str expect-eof? expected-data expected-token)
1914           (newline) (display "body: ") (write str)
1915           (newline) (display "Result: ")
1916          (let*-values
1917           (((seed token)
1918             (call-with-input-string (unesc-string str)
1919                (lambda (port)
1920                 (ssax:read-char-data port expect-eof? str-handler '()))))
1921            ((result) (reverse seed)))
1922           (write result)
1923           (display " ")
1924           (display token)
1925           (assert (equal? result (map unesc-string expected-data))
1926                   (if (eq? expected-token eof-object)
1927                     (eof-object? token)
1928                     (equal? token expected-token))))))
1929   )
1930  (test "" #t '() eof-object)
1931  (assert (failed? (test "" #f '() eof-object)))
1932  (test "  " #t '("  ") eof-object)
1933  (test "<BR/>" #f '() a-tag)
1934  (test " <BR  />" #f '(" ") a-tag)
1935
1936  (test " &lt;" #f '(" ") a-ref)
1937  (test " a&lt;" #f '(" a") a-ref)
1938  (test " a &lt;" #f '(" a ") a-ref)
1939
1940  (test " <!-- comment--> a  a<BR/>" #f '(" " " a  a") a-tag)
1941  (test " <!-- comment-->%ra  a<BR/>" #f '(" " "" "%n" "a  a") a-tag)
1942  (test " <!-- comment-->%r%na  a<BR/>" #f '(" " "" "%n" "a  a") a-tag)
1943  (test " <!-- comment-->%r%na%t%r%r%na<BR/>" #f
1944        '(" " "" "%n" "a%t" "%n" "" "%n" "a") a-tag)
1945  (test "a<!-- comment--> a  a<BR/>" #f '("a" " a  a") a-tag)
1946  (test "&#x21;<BR/>" #f '("" "!") a-tag)
1947  (test "&#x21;%n<BR/>" #f '("" "!" "%n") a-tag)
1948  (test "%t&#x21;%n<BR/>" #f '("%t" "!" "%n") a-tag)
1949  (test "%t&#x21;%na a<BR/>" #f '("%t" "!" "%na a") a-tag)
1950  (test "%t&#x21;%ra a<BR/>" #f '("%t" "!" "" "%n" "a a") a-tag)
1951  (test "%t&#x21;%r%na a<BR/>" #f '("%t" "!" "" "%n" "a a") a-tag)
1952
1953  (test " %ta &#x21;   b <BR/>" #f '(" %ta " "!" "   b ") a-tag)
1954  (test " %ta &#x20;   b <BR/>" #f '(" %ta " " " "   b ") a-tag)
1955
1956  (test "<![CDATA[<]]><BR/>" #f '("<") a-tag)
1957  (test "<![CDATA[]]]><BR/>" #f '("]") a-tag)
1958  (test "%t<![CDATA[<]]><BR/>" #f '("%t" "<") a-tag)
1959  (test "%t<![CDATA[<]]>a b<BR/>" #f '("%t" "<" "a b") a-tag)
1960  (test "%t<![CDATA[<]]>  a b<BR/>" #f '("%t" "<" "  a b") a-tag)
1961
1962  (test "%td <![CDATA[  <%r%r%n]]>  a b<BR/>" #f 
1963        '("%td " "  <" "%n" "" "%n" "  a b") a-tag)
1964))
1965
1966
1967
1968; procedure+: ssax:assert-token TOKEN KIND GI
1969; Make sure that TOKEN is of anticipated KIND and has anticipated GI
1970; Note GI argument may actually be a pair of two symbols, Namespace
1971; URI or the prefix, and of the localname.
1972; If the assertion fails, error-cont is evaluated by passing it
1973; three arguments: token kind gi. The result of error-cont is returned.
1974(define (ssax:assert-token token kind gi error-cont)
1975  (or
1976    (and (xml-token? token)
1977      (eq? kind (xml-token-kind token))
1978      (equal? gi (xml-token-head token)))
1979    (error-cont token kind gi)))
1980
1981;========================================================================
1982;               Highest-level parsers: XML to SXML
1983
1984; These parsers are a set of syntactic forms to instantiate a SSAX parser.
1985; A user can instantiate the parser to do the full validation, or
1986; no validation, or any particular validation. The user specifies
1987; which PI he wants to be notified about. The user tells what to do
1988; with the parsed character and element data. The latter handlers
1989; determine if the parsing follows a SAX or a DOM model.
1990
1991; syntax: ssax:make-pi-parser my-pi-handlers
1992; Create a parser to parse and process one Processing Element (PI).
1993
1994; my-pi-handlers
1995;       An assoc list of pairs (PI-TAG . PI-HANDLER)
1996;       where PI-TAG is an NCName symbol, the PI target, and
1997;       PI-HANDLER is a procedure PORT PI-TAG SEED
1998;       where PORT points to the first symbol after the PI target.
1999;       The handler should read the rest of the PI up to and including
2000;       the combination '?>' that terminates the PI. The handler should
2001;       return a new seed.
2002;       One of the PI-TAGs may be the symbol *DEFAULT*. The corresponding
2003;       handler will handle PIs that no other handler will. If the
2004;       *DEFAULT* PI-TAG is not specified, ssax:make-pi-parser will assume
2005;       the default handler that skips the body of the PI
2006;       
2007; The output of the ssax:make-pi-parser is a procedure
2008;       PORT PI-TAG SEED
2009; that will parse the current PI according to the user-specified handlers.
2010;
2011; The previous version of ssax:make-pi-parser was a low-level macro:
2012; (define-macro ssax:make-pi-parser
2013;   (lambda (my-pi-handlers)
2014;   `(lambda (port target seed)
2015;     (case target
2016;       ; Generate the body of the case statement
2017;       ,@(let loop ((pi-handlers my-pi-handlers) (default #f))
2018;        (cond
2019;         ((null? pi-handlers)
2020;          (if default `((else (,default port target seed)))
2021;              '((else
2022;                 (ssax:warn port "Skipping PI: " target nl)
2023;                 (ssax:skip-pi port)
2024;                 seed))))
2025;         ((eq? '*DEFAULT* (caar pi-handlers))
2026;          (loop (cdr pi-handlers) (cdar pi-handlers)))
2027;         (else
2028;          (cons
2029;           `((,(caar pi-handlers)) (,(cdar pi-handlers) port target seed))
2030;           (loop (cdr pi-handlers) default)))))))))
2031
2032(define-syntax ssax:make-pi-parser
2033  (syntax-rules ()
2034    ((ssax:make-pi-parser orig-handlers)
2035     (letrec-syntax
2036        ; Generate the clauses of the case statement
2037      ((loop
2038         (syntax-rules (*DEFAULT*)
2039           ((loop () #f accum port target seed)         ; no default
2040            (make-case 
2041              ((else
2042                 (ssax:warn port "Skipping PI: " target nl)
2043                 (ssax:skip-pi port)
2044                 seed)
2045                . accum)
2046              () target))
2047           ((loop () default accum port target seed)
2048            (make-case 
2049              ((else (default port target seed)) . accum)
2050              () target))
2051           ((loop ((*DEFAULT* . default) . handlers) old-def accum
2052              port target seed)
2053            (loop handlers default accum port target seed))
2054           ((loop ((tag . handler) . handlers) default accum port target seed)
2055            (loop handlers default
2056              (((tag) (handler port target seed)) . accum)
2057              port target seed))
2058           ))
2059        (make-case                      ; Reverse the clauses, make the 'case'
2060          (syntax-rules ()
2061            ((make-case () clauses target)
2062             (case target . clauses))
2063            ((make-case (clause . clauses) accum target)
2064             (make-case clauses (clause . accum) target)))
2065          ))
2066      (lambda (port target seed)
2067        (loop orig-handlers #f () port target seed))
2068       ))))
2069
2070(run-test
2071 (pp (ssax:make-pi-parser ()))
2072 (pp (ssax:make-pi-parser ((xml . (lambda (port target seed) seed)))))
2073 (pp (ssax:make-pi-parser ((xml . (lambda (port target seed) seed))
2074                           (html . list)
2075                           (*DEFAULT* . ssax:warn))))
2076)
2077
2078; syntax: ssax:make-elem-parser my-new-level-seed my-finish-element
2079;                               my-char-data-handler my-pi-handlers
2080
2081; Create a parser to parse and process one element, including its
2082; character content or children elements. The parser is typically
2083; applied to the root element of a document.
2084
2085; my-new-level-seed
2086;       procedure ELEM-GI ATTRIBUTES NAMESPACES EXPECTED-CONTENT SEED
2087;               where ELEM-GI is a RES-NAME of the element
2088;               about to be processed.
2089;       This procedure is to generate the seed to be passed
2090;       to handlers that process the content of the element.
2091;       This is the function identified as 'fdown' in the denotational
2092;       semantics of the XML parser given in the title comments to this
2093;       file.
2094;
2095; my-finish-element
2096;       procedure ELEM-GI ATTRIBUTES NAMESPACES PARENT-SEED SEED
2097;       This procedure is called when parsing of ELEM-GI is finished.
2098;       The SEED is the result from the last content parser (or
2099;       from my-new-level-seed if the element has the empty content).
2100;       PARENT-SEED is the same seed as was passed to my-new-level-seed.
2101;       The procedure is to generate a seed that will be the result
2102;       of the element parser.
2103;       This is the function identified as 'fup' in the denotational
2104;       semantics of the XML parser given in the title comments to this
2105;       file.
2106;
2107; my-char-data-handler
2108;       A STR-HANDLER
2109;
2110; my-pi-handlers
2111;       See ssax:make-pi-handler above
2112;
2113
2114; The generated parser is a
2115;       procedure START-TAG-HEAD PORT ELEMS ENTITIES
2116;       NAMESPACES PRESERVE-WS? SEED
2117; The procedure must be called after the start tag token has been
2118; read. START-TAG-HEAD is an UNRES-NAME from the start-element tag.
2119; ELEMS is an instance of xml-decl::elems.
2120; See ssax:complete-start-tag::preserve-ws?
2121
2122; Faults detected:
2123;       VC: XML-Spec.html#elementvalid
2124;       WFC: XML-Spec.html#GIMatch
2125
2126
2127(define-syntax ssax:make-elem-parser
2128  (syntax-rules ()
2129    ((ssax:make-elem-parser my-new-level-seed my-finish-element
2130                            my-char-data-handler my-pi-handlers)
2131 
2132   (lambda (start-tag-head port elems entities namespaces
2133                           preserve-ws? seed)
2134
2135     (define xml-space-gi (cons ssax:Prefix-XML
2136                                (string->symbol "space")))
2137
2138     (let handle-start-tag ((start-tag-head start-tag-head)
2139                            (port port) (entities entities)
2140                            (namespaces namespaces)
2141                            (preserve-ws? preserve-ws?) (parent-seed seed))
2142       (let*-values
2143        (((elem-gi attributes namespaces expected-content)
2144          (ssax:complete-start-tag start-tag-head port elems
2145                                   entities namespaces))
2146         ((seed)
2147          (my-new-level-seed elem-gi attributes
2148                              namespaces expected-content parent-seed)))
2149        (case expected-content
2150          ((EMPTY-TAG)
2151           (my-finish-element
2152            elem-gi attributes namespaces parent-seed seed))
2153          ((EMPTY)              ; The end tag must immediately follow
2154           (ssax:assert-token 
2155            (and (eqv? #\< (ssax:skip-S port)) (ssax:read-markup-token port))
2156            'END  start-tag-head
2157            (lambda (token exp-kind exp-head)
2158              (parser-error port "[elementvalid] broken for " token 
2159                     " while expecting "
2160                     exp-kind exp-head)))
2161           (my-finish-element
2162            elem-gi attributes namespaces parent-seed seed))
2163          (else         ; reading the content...
2164           (let ((preserve-ws?  ; inherit or set the preserve-ws? flag
2165                  (cond
2166                   ((assoc xml-space-gi attributes) =>
2167                    (lambda (name-value)
2168                      (equal? "preserve" (cdr name-value))))
2169                   (else preserve-ws?))))
2170             (let loop ((port port) (entities entities)
2171                        (expect-eof? #f) (seed seed))
2172               (let*-values
2173                (((seed term-token)
2174                  (ssax:read-char-data port expect-eof?
2175                                       my-char-data-handler seed)))
2176                (if (eof-object? term-token)
2177                    seed
2178                    (case (xml-token-kind term-token)
2179                      ((END)
2180                       (ssax:assert-token term-token 'END  start-tag-head
2181                          (lambda (token exp-kind exp-head)
2182                            (parser-error port "[GIMatch] broken for "
2183                                   term-token " while expecting "
2184                                   exp-kind exp-head)))
2185                       (my-finish-element
2186                        elem-gi attributes namespaces parent-seed seed))
2187                      ((PI)
2188                       (let ((seed 
2189                          ((ssax:make-pi-parser my-pi-handlers)
2190                           port (xml-token-head term-token) seed)))
2191                         (loop port entities expect-eof? seed)))
2192                      ((ENTITY-REF)
2193                       (let ((seed
2194                              (ssax:handle-parsed-entity
2195                               port (xml-token-head term-token)
2196                               entities
2197                               (lambda (port entities seed)
2198                                 (loop port entities #t seed))
2199                               my-char-data-handler
2200                               seed))) ; keep on reading the content after ent
2201                         (loop port entities expect-eof? seed)))
2202                      ((START)          ; Start of a child element
2203                       (if (eq? expected-content 'PCDATA)
2204                           (parser-error port "[elementvalid] broken for "
2205                                  elem-gi
2206                                  " with char content only; unexpected token "
2207                                  term-token))
2208                           ; Do other validation of the element content
2209                           (let ((seed
2210                                  (handle-start-tag
2211                                     (xml-token-head term-token)
2212                                     port entities namespaces
2213                                     preserve-ws? seed)))
2214                             (loop port entities expect-eof? seed)))
2215                      (else
2216                       (parser-error port "XML [43] broken for "
2217                                     term-token))))))))
2218          )))
2219))))
2220
2221
2222; syntax: ssax:make-parser user-handler-tag user-handler-proc ...
2223;
2224; Create an XML parser, an instance of the XML parsing framework.
2225; This will be a SAX, a DOM, or a specialized parser depending
2226; on the supplied user-handlers.
2227
2228; user-handler-tag is a symbol that identifies a procedural expression
2229; that follows the tag. Given below are tags and signatures of the
2230; corresponding procedures. Not all tags have to be specified. If some
2231; are omitted, reasonable defaults will apply.
2232;
2233
2234; tag: DOCTYPE
2235; handler-procedure: PORT DOCNAME SYSTEMID INTERNAL-SUBSET? SEED
2236; If internal-subset? is #t, the current position in the port
2237; is right after we have read #\[ that begins the internal DTD subset.
2238; We must finish reading of this subset before we return
2239; (or must call skip-internal-subset if we aren't interested in reading it).
2240; The port at exit must be at the first symbol after the whole
2241; DOCTYPE declaration.
2242; The handler-procedure must generate four values:
2243;       ELEMS ENTITIES NAMESPACES SEED
2244; See xml-decl::elems for ELEMS. It may be #f to switch off the validation.
2245; NAMESPACES will typically contain USER-PREFIXes for selected URI-SYMBs.
2246; The default handler-procedure skips the internal subset,
2247; if any, and returns (values #f '() '() seed)
2248
2249; tag: UNDECL-ROOT
2250; handler-procedure: ELEM-GI SEED
2251; where ELEM-GI is an UNRES-NAME of the root element. This procedure
2252; is called when an XML document under parsing contains _no_ DOCTYPE
2253; declaration.
2254; The handler-procedure, as a DOCTYPE handler procedure above,
2255; must generate four values:
2256;       ELEMS ENTITIES NAMESPACES SEED
2257; The default handler-procedure returns (values #f '() '() seed)
2258
2259; tag: DECL-ROOT
2260; handler-procedure: ELEM-GI SEED
2261; where ELEM-GI is an UNRES-NAME of the root element. This procedure
2262; is called when an XML document under parsing does contains the DOCTYPE
2263; declaration.
2264; The handler-procedure must generate a new SEED (and verify
2265; that the name of the root element matches the doctype, if the handler
2266; so wishes).
2267; The default handler-procedure is the identity function.
2268
2269; tag: NEW-LEVEL-SEED
2270; handler-procedure: see ssax:make-elem-parser, my-new-level-seed
2271
2272; tag: FINISH-ELEMENT
2273; handler-procedure: see ssax:make-elem-parser, my-finish-element
2274
2275; tag: CHAR-DATA-HANDLER
2276; handler-procedure: see ssax:make-elem-parser, my-char-data-handler
2277
2278; tag: PI
2279; handler-procedure: see ssax:make-pi-parser
2280; The default value is '()
2281 
2282; The generated parser is a
2283;       procedure PORT SEED
2284
2285; This procedure parses the document prolog and then exits to
2286; an element parser (created by ssax:make-elem-parser) to handle
2287; the rest.
2288;
2289; [1]  document ::=  prolog element Misc*
2290; [22] prolog ::= XMLDecl? Misc* (doctypedec | Misc*)?
2291; [27] Misc ::= Comment | PI |  S
2292;
2293; [28] doctypedecl ::=  '<!DOCTYPE' S Name (S ExternalID)? S?
2294;                       ('[' (markupdecl | PEReference | S)* ']' S?)? '>'
2295; [29] markupdecl ::= elementdecl | AttlistDecl
2296;                      | EntityDecl
2297;                      | NotationDecl | PI
2298;                      | Comment
2299;
2300
2301
2302; This is ssax:make-parser with all the (specialization) handlers given
2303; as positional arguments. It is called by ssax:make-parser, see below
2304(define-syntax ssax:make-parser/positional-args
2305  (syntax-rules ()
2306    ((ssax:make-parser/positional-args
2307       *handler-DOCTYPE
2308       *handler-UNDECL-ROOT
2309       *handler-DECL-ROOT
2310       *handler-NEW-LEVEL-SEED
2311       *handler-FINISH-ELEMENT
2312       *handler-CHAR-DATA-HANDLER
2313       *handler-PI)
2314  (lambda (port seed)
2315
2316     ; We must've just scanned the DOCTYPE token
2317     ; Handle the doctype declaration and exit to
2318     ; scan-for-significant-prolog-token-2, and eventually, to the
2319     ; element parser.
2320     (define (handle-decl port token-head seed)
2321       (or (eq? (string->symbol "DOCTYPE") token-head)
2322           (parser-error port "XML [22], expected DOCTYPE declaration, found "
2323                  token-head))
2324       (assert-curr-char ssax:S-chars "XML [28], space after DOCTYPE" port)
2325       (ssax:skip-S port)
2326       (let*-values
2327        (((docname) (ssax:read-QName port))
2328         ((systemid)
2329          (and (ssax:ncname-starting-char? (ssax:skip-S port))
2330               (ssax:read-external-id port)))
2331         ((internal-subset?)
2332          (begin (ssax:skip-S port)
2333            (eqv? #\[ (assert-curr-char '(#\> #\[)
2334                                        "XML [28], end-of-DOCTYPE" port))))
2335         ((elems entities namespaces seed)
2336          (*handler-DOCTYPE port docname systemid
2337                            internal-subset? seed))
2338         )
2339        (scan-for-significant-prolog-token-2 port elems entities namespaces
2340                                             seed)))
2341
2342
2343     ; Scan the leading PIs until we encounter either a doctype declaration
2344     ; or a start token (of the root element)
2345     ; In the latter two cases, we exit to the appropriate continuation
2346     (define (scan-for-significant-prolog-token-1 port seed)
2347       (let ((token (ssax:scan-Misc port)))
2348         (if (eof-object? token)
2349             (parser-error port "XML [22], unexpected EOF")
2350             (case (xml-token-kind token)
2351               ((PI)
2352                (let ((seed 
2353                       ((ssax:make-pi-parser *handler-PI)
2354                        port (xml-token-head token) seed)))
2355                  (scan-for-significant-prolog-token-1 port seed)))
2356               ((DECL) (handle-decl port (xml-token-head token) seed))
2357               ((START)
2358                (let*-values
2359                 (((elems entities namespaces seed)
2360                   (*handler-UNDECL-ROOT (xml-token-head token) seed)))
2361                 (element-parser (xml-token-head token) port elems
2362                                 entities namespaces #f seed)))
2363               (else (parser-error port "XML [22], unexpected markup "
2364                                   token))))))
2365
2366
2367     ; Scan PIs after the doctype declaration, till we encounter
2368     ; the start tag of the root element. After that we exit
2369     ; to the element parser
2370     (define (scan-for-significant-prolog-token-2 port elems entities
2371                                                  namespaces seed)
2372       (let ((token (ssax:scan-Misc port)))
2373         (if (eof-object? token)
2374             (parser-error port "XML [22], unexpected EOF")
2375             (case (xml-token-kind token)
2376               ((PI)
2377                (let ((seed 
2378                       ((ssax:make-pi-parser *handler-PI)
2379                        port (xml-token-head token) seed)))
2380                  (scan-for-significant-prolog-token-2 port elems entities
2381                                                       namespaces seed)))
2382               ((START)
2383                (element-parser (xml-token-head token) port elems
2384                  entities namespaces #f
2385                  (*handler-DECL-ROOT (xml-token-head token) seed)))
2386               (else (parser-error port "XML [22], unexpected markup "
2387                                   token))))))
2388
2389
2390     ; A procedure start-tag-head port elems entities namespaces
2391     ;           preserve-ws? seed
2392     (define element-parser
2393       (ssax:make-elem-parser *handler-NEW-LEVEL-SEED
2394                              *handler-FINISH-ELEMENT
2395                              *handler-CHAR-DATA-HANDLER
2396                              *handler-PI))
2397
2398     ; Get the ball rolling ...
2399     (scan-for-significant-prolog-token-1 port seed)
2400))))
2401
2402
2403
2404; The following meta-macro turns a regular macro (with positional
2405; arguments) into a form with keyword (labeled) arguments.  We later
2406; use the meta-macro to convert ssax:make-parser/positional-args into
2407; ssax:make-parser. The latter provides a prettier (with labeled
2408; arguments and defaults) interface to
2409; ssax:make-parser/positional-args
2410;
2411; ssax:define-labeled-arg-macro LABELED-ARG-MACRO-NAME
2412;               (POS-MACRO-NAME ARG-DESCRIPTOR ...)
2413; expands into the definition of a macro
2414;       LABELED-ARG-MACRO-NAME KW-NAME KW-VALUE KW-NAME1 KW-VALUE1 ...
2415; which, in turn, expands into
2416;       POS-MACRO-NAME ARG1 ARG2 ...
2417; where each ARG1 etc. comes either from KW-VALUE or from
2418; the deafult part of ARG-DESCRIPTOR. ARG1 corresponds to the first
2419; ARG-DESCRIPTOR, ARG2 corresponds to the second descriptor, etc.
2420; Here ARG-DESCRIPTOR describes one argument of the positional macro.
2421; It has the form
2422;       (ARG-NAME DEFAULT-VALUE)
2423; or
2424;       (ARG-NAME)
2425; In the latter form, the default value is not given, so that the invocation of
2426; LABELED-ARG-MACRO-NAME must mention the corresponding parameter.
2427; ARG-NAME can be anything: an identifier, a string, or even a number.
2428
2429
2430(define-syntax ssax:define-labeled-arg-macro
2431  (syntax-rules ()
2432    ((ssax:define-labeled-arg-macro
2433       labeled-arg-macro-name
2434       (positional-macro-name
2435         (arg-name . arg-def) ...))
2436      (define-syntax labeled-arg-macro-name
2437        (syntax-rules ()
2438          ((labeled-arg-macro-name . kw-val-pairs)
2439            (letrec-syntax
2440              ((find 
2441                 (syntax-rules (arg-name ...)
2442                   ((find k-args (arg-name . default) arg-name
2443                      val . others)        ; found arg-name among kw-val-pairs
2444                    (next val . k-args)) ...
2445                   ((find k-args key arg-no-match-name val . others)
2446                     (find k-args key . others))
2447                   ((find k-args (arg-name default)) ; default must be here
2448                     (next default . k-args)) ...
2449                   ))
2450                (next                   ; pack the continuation to find
2451                  (syntax-rules ()
2452                    ((next val vals key . keys)
2453                      (find ((val . vals) . keys) key . kw-val-pairs))
2454                    ((next val vals)    ; processed all arg-descriptors
2455                      (rev-apply (val) vals))))
2456                (rev-apply
2457                  (syntax-rules ()
2458                    ((rev-apply form (x . xs))
2459                      (rev-apply (x . form) xs))
2460                    ((rev-apply form ()) form))))
2461              (next positional-macro-name () 
2462                (arg-name . arg-def) ...))))))))
2463
2464
2465; The definition of ssax:make-parser
2466(ssax:define-labeled-arg-macro ssax:make-parser
2467  (ssax:make-parser/positional-args
2468    (DOCTYPE
2469      (lambda (port docname systemid internal-subset? seed)
2470        (when internal-subset?
2471          (ssax:warn port "Internal DTD subset is not currently handled ")
2472          (ssax:skip-internal-dtd port))
2473        (ssax:warn port "DOCTYPE DECL " docname " " 
2474          systemid " found and skipped")
2475        (values #f '() '() seed)
2476        ))
2477    (UNDECL-ROOT
2478      (lambda (elem-gi seed) (values #f '() '() seed)))
2479    (DECL-ROOT
2480      (lambda (elem-gi seed) seed))
2481    (NEW-LEVEL-SEED)            ; required
2482    (FINISH-ELEMENT)            ; required
2483    (CHAR-DATA-HANDLER)         ; required
2484    (PI ())
2485    ))
2486
2487(run-test
2488 (letrec ((simple-parser
2489           (lambda (str doctype-fn)
2490             (call-with-input-string str
2491                 (lambda (port)
2492                   ((ssax:make-parser
2493                     NEW-LEVEL-SEED 
2494                     (lambda (elem-gi attributes namespaces
2495                                      expected-content seed)
2496                       '())
2497   
2498                     FINISH-ELEMENT
2499                     (lambda (elem-gi attributes namespaces parent-seed seed)
2500                       (let
2501                           ((seed (if (null? namespaces) (reverse seed)
2502                                      (cons (list '*NAMESPACES* namespaces)
2503                                            (reverse seed)))))
2504                         (let ((seed (if (attlist-null? attributes) seed
2505                                         (cons
2506                                          (cons '@ 
2507                                           (map (lambda (attr)
2508                                              (list (car attr) (cdr attr)))
2509                                                (attlist->alist attributes)))
2510                                          seed))))
2511                           (cons (cons elem-gi seed) parent-seed))))
2512
2513                     CHAR-DATA-HANDLER
2514                     (lambda (string1 string2 seed)
2515                       (if (string-null? string2) (cons string1 seed)
2516                           (cons* string2 string1 seed)))
2517
2518                     DOCTYPE
2519                     (lambda (port docname systemid internal-subset? seed)
2520                       (when internal-subset?
2521                          (ssax:warn port
2522                            "Internal DTD subset is not currently handled ")
2523                          (ssax:skip-internal-dtd port))
2524                       (ssax:warn port "DOCTYPE DECL " docname " "
2525                             systemid " found and skipped")
2526                       (doctype-fn docname seed))
2527
2528                     UNDECL-ROOT
2529                     (lambda (elem-gi seed)
2530                       (doctype-fn elem-gi seed))
2531                     )
2532                    port '())))))
2533
2534          (dummy-doctype-fn (lambda (elem-gi seed) (values #f '() '() seed)))
2535          (test
2536           (lambda (str doctype-fn expected)
2537             (cout nl "Parsing: " str nl)
2538             (let ((result (simple-parser (unesc-string str) doctype-fn)))
2539               (write result)
2540               (assert (equal? result expected)))))
2541          )
2542
2543   (test "<BR/>" dummy-doctype-fn '(('"BR")))
2544   (assert (failed? (test "<BR>" dummy-doctype-fn '())))
2545   (test "<BR></BR>" dummy-doctype-fn '(('"BR")))
2546   (assert (failed? (test "<BR></BB>" dummy-doctype-fn '())))
2547
2548   (test "   <A HREF='URL'> link <I>itlink </I> &amp;amp;</A>"
2549         dummy-doctype-fn 
2550         '(('"A" (@ ('"HREF" "URL")) " link " ('"I" "itlink ")
2551            " " "&" "amp;")))
2552
2553   (test
2554      "   <A HREF='URL' xml:space='preserve'> link <I>itlink </I> &amp;amp;</A>" dummy-doctype-fn 
2555      '(('"A" (@ ('"HREF" "URL") (('"xml" . '"space") "preserve"))
2556           " link " ('"I" "itlink ") " " "&" "amp;")))
2557
2558   (test "   <A HREF='URL' xml:space='preserve'> link <I xml:space='default'>itlink </I> &amp;amp;</A>" dummy-doctype-fn
2559         '(('"A" (@ ('"HREF" "URL") (('"xml" . '"space") "preserve"))
2560              " link "
2561              ('"I" (@ (('"xml" . '"space") "default")) "itlink ")
2562              " " "&" "amp;")))
2563   (test "<itemize><item>This   is item 1 </item>%n<!-- Just:a comment --><item>Item 2</item>%n </itemize>" dummy-doctype-fn 
2564         `(('"itemize" ('"item" "This   is item 1 ")
2565            ,(unesc-string "%n") ('"item" "Item 2") ,(unesc-string "%n "))))
2566  (test " <P><![CDATA[<BR>%n<![CDATA[<BR>]]&gt;]]></P>"
2567        dummy-doctype-fn  `(('"P" "<BR>" ,nl "<![CDATA[<BR>" "]]" "" ">")))
2568
2569  (test " <P><![CDATA[<BR>%r<![CDATA[<BR>]]&gt;]]></P>"
2570        dummy-doctype-fn `(('"P" "<BR>" ,nl "<![CDATA[<BR>" "]]" "" ">")))
2571
2572  (test "<?xml version='1.0'?>%n%n<Reports TStamp='1'></Reports>"
2573        dummy-doctype-fn '(('"Reports" (@ ('"TStamp" "1")))))
2574  (test "%n<?PI xxx?><!-- Comment %n -%r-->%n<?PI1 zzz?><T/>" 
2575        dummy-doctype-fn '(('"T")))
2576  (test "<!DOCTYPE T SYSTEM 'system1' ><!-- comment -->%n<T/>"
2577        (lambda (elem-gi seed) (assert (equal? elem-gi ''"T"))
2578                (values #f '() '() seed))
2579        '(('"T")))
2580  (test "<!DOCTYPE T PUBLIC '//EN/T' \"system1\" [ <!ELEMENT a 'aa'> ]>%n<?pi?><T/>" 
2581        (lambda (elem-gi seed) (assert (equal? elem-gi ''"T"))
2582                (values #f '() '() seed))
2583        '(('"T")))
2584  (test "<BR/>"
2585        (lambda (elem-gi seed)
2586          (values '(('"BR" EMPTY ())) '() '() seed)) '(('"BR")))
2587  (test "<BR></BR>"
2588        (lambda (elem-gi seed)
2589          (values '(('"BR" EMPTY ())) '() '() seed)) '(('"BR")))
2590  (assert (failed? (test "<BR>aa</BR>"
2591        (lambda (elem-gi seed)
2592          (values '(('"BR" EMPTY ())) '() '() seed)) '())))
2593  (test "<BR>aa</BR>"
2594        (lambda (elem-gi seed)
2595          (values '(('"BR" PCDATA ())) '() '() seed)) '(('"BR" "aa")))
2596  (assert (failed? (test "<BR>a<I>a</I></BR>"
2597        (lambda (elem-gi seed)
2598          (values '(('"BR" PCDATA ())) '() '() seed)) '())))
2599  (test "<BR>a<I>a</I></BR>"
2600        (lambda (elem-gi seed)
2601          (values '(('"BR" ANY ()) ('"I" PCDATA ())) '() '() seed))
2602          '(('"BR" "a" ('"I" "a"))))
2603
2604
2605  (test "<DIV>Example: \"&example;\"</DIV>"
2606        (lambda (elem-gi seed)
2607          (values #f '((example . "<P>An    ampersand (&#38;) may   be escaped numerically (&#38;#38;) or with a general entity (&amp;amp;).</P>")) '() seed))
2608        '(('"DIV" "Example: \""
2609           ('"P" "An    ampersand (" "&" ") may   be escaped numerically (" "&" "#38;) or with a general entity (" "&" "amp;).") "\"")))
2610 (test "<DIV>Example: \"&example;\" <P/></DIV>"
2611        (lambda (elem-gi seed)
2612          (values #f '(('"quote" . "<I>example:</I> ex")
2613                       ('"example" . "<Q>&quote;!</Q>?")) '() seed))
2614          '(('"DIV" "Example: \"" ('"Q" ('"I" "example:") " ex" "!") "?"
2615                 "\" "  ('"P"))))
2616 (assert (failed?
2617   (test "<DIV>Example: \"&example;\" <P/></DIV>"
2618        (lambda (elem-gi seed)
2619          (values #f '(('"quote" . "<I>example:")
2620                       ('"example" . "<Q>&quote;</I>!</Q>?")) '() seed))
2621        '())))
2622
2623 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2624        (lambda (elem-gi seed)
2625          (values #f '() '() seed))
2626       '((('"URI1" . '"DIV") (@ ('"B" "B") (('"URI1" . '"B") "A"))
2627          (*NAMESPACES* (('"A" '"URI1" . '"URI1")
2628                         (*DEFAULT* '"URI1" . '"URI1")))
2629          (('"URI1" . '"P")
2630           (*NAMESPACES* ((*DEFAULT* #f . #f) ('"A" '"URI1" . '"URI1")
2631                          (*DEFAULT* '"URI1" . '"URI1")))
2632           ('"BR"
2633            (*NAMESPACES* ((*DEFAULT* #f . #f)
2634                           ('"A" '"URI1" . '"URI1")
2635                           (*DEFAULT* '"URI1" . '"URI1"))))))))
2636 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2637        (lambda (elem-gi seed)
2638          (values #f '() '((#f '"UA" . '"URI1")) seed))
2639       '((('"UA" . '"DIV") (@ ('"B" "B") (('"UA" . '"B") "A"))
2640          (*NAMESPACES* (('"A" '"UA" . '"URI1")
2641                         (*DEFAULT* '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2642          (('"UA" . '"P")
2643           (*NAMESPACES* ((*DEFAULT* #f . #f) ('"A" '"UA" . '"URI1")
2644                          (*DEFAULT* '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2645           ('"BR"
2646            (*NAMESPACES* ((*DEFAULT* #f . #f) ('"A" '"UA" . '"URI1")
2647                           (*DEFAULT* '"UA" . '"URI1")
2648                           (#f '"UA" . '"URI1"))))))))
2649 ; uniqattr should fail
2650 (assert (failed?
2651 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2652        (lambda (elem-gi seed)
2653          (values
2654           `(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2655                       (('"A" . '"B") CDATA IMPLIED #f)
2656                       (('"C" . '"B") CDATA IMPLIED "xx")
2657                       (('"xmlns" . '"C") CDATA IMPLIED "URI1")
2658                       ))
2659             (('"A" . '"P") ANY ()) ('"BR" '"EMPTY" ()))
2660           '() '((#f '"UA" . '"URI1")) seed))
2661        '())))
2662 ; prefix C undeclared
2663 (assert (failed?
2664 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2665        (lambda (elem-gi seed)
2666          (values
2667           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2668                       ('"xmlns"  CDATA IMPLIED "URI1")
2669                       (('"A" . '"B") CDATA IMPLIED #f)
2670                       (('"C" . '"B") CDATA IMPLIED "xx")
2671                       ))
2672             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2673           '() '((#f '"UA" . '"URI1")) seed))
2674        '())))
2675
2676 ; contradiction to xmlns declaration
2677 (assert (failed?
2678 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2679        (lambda (elem-gi seed)
2680          (values
2681           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2682                       ('"xmlns"  CDATA FIXED "URI2")
2683                       (('"A" . '"B") CDATA IMPLIED #f)
2684                       ))
2685             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2686           '() '((#f '"UA" . '"URI1")) seed))
2687        '())))
2688
2689 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2690        (lambda (elem-gi seed)
2691          (values
2692           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2693                       ('"xmlns"  CDATA FIXED "URI1")
2694                       (('"A" . '"B") CDATA IMPLIED #f)
2695                       ))
2696             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2697           '() '((#f '"UA" . '"URI1")) seed))
2698       '((('"UA" . '"DIV") (@ ('"B" "B") (('"UA" . '"B") "A"))
2699          (*NAMESPACES* ((*DEFAULT* '"UA" . '"URI1")
2700                         ('"A" '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2701          (('"UA" . '"P")
2702           (*NAMESPACES* ((*DEFAULT* #f . #f) 
2703                          (*DEFAULT* '"UA" . '"URI1")
2704                          ('"A" '"UA" . '"URI1") (#f '"UA" . '"URI1")))
2705           ('"BR"
2706            (*NAMESPACES* ((*DEFAULT* #f . #f) (*DEFAULT* '"UA" . '"URI1")
2707                           ('"A" '"UA" . '"URI1") (#f '"UA" . '"URI1"))))))))
2708
2709 (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>"
2710        (lambda (elem-gi seed)
2711          (values
2712           '(('"DIV" ANY (('"B" CDATA IMPLIED #f)
2713                          (('"A" . '"B") CDATA IMPLIED #f)
2714                          (('"C" . '"B") CDATA IMPLIED "xx")
2715                          (('"xmlns" . '"C") CDATA IMPLIED "URI2")
2716                       ))
2717             (('"A" . '"P") ANY ()) ('"BR" EMPTY ()))
2718           '() '((#f '"UA" . '"URI1")) seed))
2719        '((('"UA" . '"DIV") (@ ('"B" "B") (('"UA" . '"B") "A")
2720                               (('"URI2" . '"B") "xx"))
2721           (*NAMESPACES* ((*DEFAULT* '"UA" . '"URI1")
2722                          ('"A" '"UA" . '"URI1")
2723                          ('"C" '"URI2" . '"URI2")
2724                          (#f '"UA" . '"URI1")))
2725           (('"UA" . '"P")
2726            (*NAMESPACES* ((*DEFAULT* #f . #f) (*DEFAULT* '"UA" . '"URI1")
2727                           ('"A" '"UA" . '"URI1")
2728                           ('"C" '"URI2" . '"URI2") (#f '"UA" . '"URI1")))
2729            ('"BR" 
2730             (*NAMESPACES* ((*DEFAULT* #f . #f)
2731                            (*DEFAULT* '"UA" . '"URI1")
2732                            ('"A" '"UA" . '"URI1")
2733                            ('"C" '"URI2" . '"URI2")
2734                            (#f '"UA" . '"URI1"))))))))
2735))
2736
2737   
2738
2739;========================================================================
2740;               Highest-level parsers: XML to SXML
2741;
2742
2743; First, a few utility procedures that turned out useful
2744
2745; procedure: ssax:reverse-collect-str LIST-OF-FRAGS -> LIST-OF-FRAGS
2746; given the list of fragments (some of which are text strings)
2747; reverse the list and concatenate adjacent text strings.
2748; We can prove from the general case below that if LIST-OF-FRAGS
2749; has zero or one element, the result of the procedure is equal?
2750; to its argument. This fact justifies the shortcut evaluation below.
2751(define (ssax:reverse-collect-str fragments)
2752  (cond
2753    ((null? fragments) '())     ; a shortcut
2754    ((null? (cdr fragments)) fragments) ; see the comment above
2755    (else
2756      (let loop ((fragments fragments) (result '()) (strs '()))
2757        (cond
2758          ((null? fragments)
2759            (if (null? strs) result
2760              (cons (string-concatenate/shared strs) result)))
2761          ((string? (car fragments))
2762            (loop (cdr fragments) result (cons (car fragments) strs)))
2763          (else
2764            (loop (cdr fragments)
2765              (cons
2766                (car fragments)
2767                (if (null? strs) result
2768                  (cons (string-concatenate/shared strs) result)))
2769              '())))))))
2770
2771
2772;     ssax:reverse-collect-str-drop-ws LIST-OF-FRAGS -> LIST-OF-FRAGS
2773; given the list of fragments (some of which are text strings)
2774; reverse the list and concatenate adjacent text strings.
2775; We also drop "unsignificant" whitespace, that is, whitespace
2776; in front, behind and between elements. The whitespace that
2777; is included in character data is not affected.
2778; We use this procedure to "intelligently" drop "insignificant"
2779; whitespace in the parsed SXML. If the strict compliance with
2780; the XML Recommendation regarding the whitespace is desired, please
2781; use the ssax:reverse-collect-str procedure instead.
2782
2783(define (ssax:reverse-collect-str-drop-ws fragments)
2784  (cond
2785    ((null? fragments) '())             ; a shortcut
2786    ((null? (cdr fragments))            ; another shortcut
2787     (if (and (string? (car fragments)) (string-whitespace? (car fragments)))
2788       '() fragments))                  ; remove trailing ws
2789    (else
2790      (let loop ((fragments fragments) (result '()) (strs '())
2791                  (all-whitespace? #t))
2792        (cond
2793          ((null? fragments)
2794            (if all-whitespace? result  ; remove leading ws
2795              (cons (string-concatenate/shared strs) result)))
2796          ((string? (car fragments))
2797            (loop (cdr fragments) result (cons (car fragments) strs)
2798              (and all-whitespace?
2799                (string-whitespace? (car fragments)))))
2800          (else
2801            (loop (cdr fragments)
2802              (cons
2803                (car fragments)
2804                (if all-whitespace? result
2805                  (cons (string-concatenate/shared strs) result)))
2806              '() #t)))))))
2807
2808
2809; procedure: ssax:xml->sxml PORT NAMESPACE-PREFIX-ASSIG
2810;
2811; This is an instance of a SSAX parser above that returns an SXML
2812; representation of the XML document to be read from PORT.
2813; NAMESPACE-PREFIX-ASSIG is a list of (USER-PREFIX . URI-STRING)
2814; that assigns USER-PREFIXes to certain namespaces identified by
2815; particular URI-STRINGs. It may be an empty list.
2816; The procedure returns an SXML tree. The port points out to the
2817; first character after the root element.
2818
2819(define (ssax:xml->sxml port namespace-prefix-assig)
2820  (letrec
2821      ((namespaces
2822        (map (lambda (el)
2823               (cons* #f (car el) (ssax:uri-string->symbol (cdr el))))
2824             namespace-prefix-assig))
2825
2826       (RES-NAME->SXML
2827        (lambda (res-name)
2828          (string->symbol
2829           (string-append
2830            (symbol->string (car res-name))
2831            ":"
2832            (symbol->string (cdr res-name))))))
2833
2834       )
2835    (let ((result
2836           (reverse
2837            ((ssax:make-parser
2838             NEW-LEVEL-SEED 
2839             (lambda (elem-gi attributes namespaces
2840                              expected-content seed)
2841               '())
2842   
2843             FINISH-ELEMENT
2844             (lambda (elem-gi attributes namespaces parent-seed seed)
2845               (let ((seed (ssax:reverse-collect-str-drop-ws seed))
2846                     (attrs
2847                      (attlist-fold
2848                       (lambda (attr accum)
2849                         (cons (list
2850                                (if (symbol? (car attr)) (car attr)
2851                                    (RES-NAME->SXML (car attr)))
2852                                (cdr attr)) accum))
2853                       '() attributes)))
2854                 (cons
2855                  (cons
2856                   (if (symbol? elem-gi) elem-gi
2857                       (RES-NAME->SXML elem-gi))
2858                   (if (null? attrs) seed
2859                       (cons (cons '@ attrs) seed)))
2860                  parent-seed)))
2861
2862             CHAR-DATA-HANDLER
2863             (lambda (string1 string2 seed)
2864               (if (string-null? string2) (cons string1 seed)
2865                   (cons* string2 string1 seed)))
2866
2867             DOCTYPE
2868             (lambda (port docname systemid internal-subset? seed)
2869               (when internal-subset?
2870                     (ssax:warn port
2871                           "Internal DTD subset is not currently handled ")
2872                     (ssax:skip-internal-dtd port))
2873               (ssax:warn port "DOCTYPE DECL " docname " "
2874                     systemid " found and skipped")
2875               (values #f '() namespaces seed))
2876
2877             UNDECL-ROOT
2878             (lambda (elem-gi seed)
2879               (values #f '() namespaces seed))
2880
2881             PI
2882             ((*DEFAULT* .
2883                (lambda (port pi-tag seed)
2884                  (cons
2885                   (list '*PI* pi-tag (ssax:read-pi-body-as-string port))
2886                   seed))))
2887             )
2888            port '()))))
2889      (cons '*TOP*
2890            (if (null? namespace-prefix-assig) result
2891                (cons
2892                 (list '@ (cons '*NAMESPACES* 
2893                                 (map (lambda (ns) (list (car ns) (cdr ns)))
2894                                      namespace-prefix-assig)))
2895                      result)))
2896)))
2897
2898; a few lines of validation code
2899(run-test (letrec
2900    ((test (lambda (str namespace-assig expected-res)
2901          (newline) (display "input: ")
2902          (write (unesc-string str)) (newline) (display "Result: ")
2903          (let ((result
2904                 (call-with-input-string (unesc-string str)
2905                     (lambda (port)
2906                       (ssax:xml->sxml port namespace-assig)))))
2907            (pp result)
2908            (assert (equal_? result expected-res))))))
2909
2910    (test " <BR/>" '() '(*TOP* (BR)))
2911    (test "<BR></BR>" '() '(*TOP* (BR)))
2912    (test " <BR CLEAR='ALL'%nCLASS='Class1'/>" '()
2913          '(*TOP* (BR (@ (CLEAR "ALL") (CLASS "Class1")))))
2914    (test "   <A HREF='URL'>  link <I>itlink </I> &amp;amp;</A>" '()
2915          '(*TOP* (A (@ (HREF "URL")) "  link " (I "itlink ") " &amp;")))
2916    (test "   <A HREF='URL' xml:space='preserve'>  link <I>itlink </I> &amp;amp;</A>" '()
2917          '(*TOP* (A (@ (xml:space "preserve") (HREF "URL"))
2918                     "  link " (I "itlink ") " &amp;")))
2919    (test "   <A HREF='URL' xml:space='preserve'>  link <I xml:space='default'>itlink </I> &amp;amp;</A>" '()
2920          '(*TOP* (A (@ (xml:space "preserve") (HREF "URL"))
2921                     "  link " (I (@ (xml:space "default"))
2922                                  "itlink ") " &amp;")))
2923    (test " <P><?pi1  p1 content ?>?<?pi2 pi2? content? ??></P>" '()
2924          '(*TOP* (P (*PI* pi1 "p1 content ") "?"
2925                     (*PI* pi2 "pi2? content? ?"))))
2926    (test " <P>some text <![CDATA[<]]>1%n&quot;<B>strong</B>&quot;%r</P>"
2927          '()
2928          `(*TOP* (P ,(unesc-string "some text <1%n\"")
2929                      (B "strong") ,(unesc-string "\"%n"))))
2930    (test " <P><![CDATA[<BR>%n<![CDATA[<BR>]]&gt;]]></P>" '()
2931          `(*TOP* (P ,(unesc-string "<BR>%n<![CDATA[<BR>]]>"))))
2932;    (test "<T1><T2>it&apos;s%r%nand   that%n</T2>%r%n%r%n%n</T1>" '()
2933;         '(*TOP* (T1 (T2 "it's%nand   that%n") "%n%n%n")))
2934    (test "<T1><T2>it&apos;s%r%nand   that%n</T2>%r%n%r%n%n</T1>" '()
2935          `(*TOP* (T1 (T2 ,(unesc-string "it's%nand   that%n")))))
2936    (test "<T1><T2>it&apos;s%rand   that%n</T2>%r%n%r%n%n</T1>" '()
2937          `(*TOP* (T1 (T2 ,(unesc-string "it's%nand   that%n")))))
2938    (test "<!DOCTYPE T SYSTEM 'system1' ><!-- comment -->%n<T/>" '()
2939          '(*TOP* (T)))
2940    (test "<?xml version='1.0'?>%n<WEIGHT unit=\"pound\">%n<NET certified='certified'> 67 </NET>%n<GROSS> 95 </GROSS>%n</WEIGHT>" '()
2941          '(*TOP* (*PI* xml "version='1.0'") (WEIGHT (@ (unit "pound"))
2942                (NET (@ (certified "certified")) " 67 ")
2943                (GROSS " 95 "))
2944                  ))
2945;     (test "<?xml version='1.0'?>%n<WEIGHT unit=\"pound\">%n<NET certified='certified'> 67 </NET>%n<GROSS> 95 </GROSS>%n</WEIGHT>" '()
2946;         '(*TOP* (*PI* xml "version='1.0'") (WEIGHT (@ (unit "pound"))
2947;                "%n" (NET (@ (certified "certified")) " 67 ")
2948;                "%n" (GROSS " 95 ") "%n")
2949;                 ))
2950    (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>" '()
2951          '(*TOP* (URI1:DIV (@ (URI1:B "A") (B "B")) (URI1:P (BR)))))
2952    (test "<DIV A:B='A' B='B' xmlns:A='URI1' xmlns='URI1'><A:P xmlns=''><BR/></A:P></DIV>" '((UA . "URI1"))
2953          '(*TOP* (@ (*NAMESPACES* (UA "URI1")))
2954                  (UA:DIV (@ (UA:B "A") (B "B")) (UA:P (BR)))))
2955
2956    ; A few tests from XML Namespaces Recommendation
2957    (test (string-append
2958           "<x xmlns:edi='http://ecommerce.org/schema'>"
2959           "<!-- the 'taxClass' attribute's  ns http://ecommerce.org/schema -->"
2960           "<lineItem edi:taxClass='exempt'>Baby food</lineItem>" nl
2961           "</x>") '()
2962           '(*TOP* 
2963             (x (lineItem
2964                 (@ (http://ecommerce.org/schema:taxClass "exempt"))
2965            "Baby food"))))
2966    (test (string-append
2967           "<x xmlns:edi='http://ecommerce.org/schema'>"
2968           "<!-- the 'taxClass' attribute's  ns http://ecommerce.org/schema -->"
2969           "<lineItem edi:taxClass='exempt'>Baby food</lineItem>"
2970           "</x>") '((EDI . "http://ecommerce.org/schema"))
2971           '(*TOP*
2972             (@ (*NAMESPACES* (EDI "http://ecommerce.org/schema")))
2973             (x (lineItem
2974                 (@ (EDI:taxClass "exempt"))
2975            "Baby food"))))
2976
2977    (test (string-append
2978           "<bk:book xmlns:bk='urn:loc.gov:books' "
2979                     "xmlns:isbn='urn:ISBN:0-395-36341-6'>"
2980           "<bk:title>Cheaper by the Dozen</bk:title>"
2981           "<isbn:number>1568491379</isbn:number></bk:book>")
2982          '()
2983          '(*TOP* (urn:loc.gov:books:book
2984                   (urn:loc.gov:books:title "Cheaper by the Dozen")
2985                   (urn:ISBN:0-395-36341-6:number "1568491379"))))
2986
2987    (test (string-append
2988           "<!-- initially, the default namespace is 'books' -->"
2989           "<book xmlns='urn:loc.gov:books' "
2990           "xmlns:isbn='urn:ISBN:0-395-36341-6'>"
2991           "<title>Cheaper by the Dozen</title>"
2992           "<isbn:number>1568491379</isbn:number>"
2993           "<notes>"
2994           "<!-- make HTML the default namespace for some commentary -->"
2995           "<p xmlns='urn:w3-org-ns:HTML'>"
2996           "This is a <i>funny</i> book!"
2997            "</p>"
2998            "</notes>"
2999            "</book>") '()
3000            '(*TOP* (urn:loc.gov:books:book
3001                   (urn:loc.gov:books:title "Cheaper by the Dozen")
3002                   (urn:ISBN:0-395-36341-6:number "1568491379")
3003                   (urn:loc.gov:books:notes
3004                    (urn:w3-org-ns:HTML:p 
3005                     "This is a " (urn:w3-org-ns:HTML:i "funny")
3006                     " book!")))))
3007
3008    (test (string-append
3009           "<Beers>"
3010           "<!-- the default namespace is now that of HTML -->"
3011           "<table xmlns='http://www.w3.org/TR/REC-html40'>"
3012           "<th><td>Name</td><td>Origin</td><td>Description</td></th>"
3013           "<tr>"
3014           "<!-- no default namespace inside table cells -->"
3015           "<td><brandName xmlns=\"\">Huntsman</brandName></td>"
3016           "<td><origin xmlns=''>Bath, UK</origin></td>"
3017           "<td>"
3018              "<details xmlns=''><class>Bitter</class><hop>Fuggles</hop>"
3019              "<pro>Wonderful hop, light alcohol, good summer beer</pro>"
3020              "<con>Fragile; excessive variance pub to pub</con>"
3021              "</details>"
3022           "</td>"
3023           "</tr>"
3024           "</table>"
3025           "</Beers>")
3026              '((html . "http://www.w3.org/TR/REC-html40"))
3027              '(*TOP*
3028                (@ (*NAMESPACES* (html "http://www.w3.org/TR/REC-html40")))
3029                (Beers (html:table
3030                (html:th (html:td "Name")
3031                         (html:td "Origin")
3032                         (html:td "Description"))
3033                (html:tr (html:td (brandName "Huntsman"))
3034                         (html:td (origin "Bath, UK"))
3035                         (html:td 
3036                          (details 
3037                           (class "Bitter")
3038                        (hop "Fuggles")
3039                        (pro "Wonderful hop, light alcohol, good summer beer")
3040                        (con "Fragile; excessive variance pub to pub"))))))))
3041
3042    (test (string-append
3043       "<!-- 1 --><RESERVATION xmlns:HTML='http://www.w3.org/TR/REC-html40'>"
3044       "<!-- 2 --><NAME HTML:CLASS=\"largeSansSerif\">Layman, A</NAME>"
3045       "<!-- 3 --><SEAT CLASS='Y' HTML:CLASS=\"largeMonotype\">33B</SEAT>"
3046       "<!-- 4 --><HTML:A HREF='/cgi-bin/ResStatus'>Check Status</HTML:A>"
3047       "<!-- 5 --><DEPARTURE>1997-05-24T07:55:00+1</DEPARTURE></RESERVATION>")
3048          '((HTML . "http://www.w3.org/TR/REC-html40"))
3049          '(*TOP*
3050            (@ (*NAMESPACES* (HTML "http://www.w3.org/TR/REC-html40")))
3051             (RESERVATION
3052              (NAME (@ (HTML:CLASS "largeSansSerif")) "Layman, A")
3053              (SEAT (@ (HTML:CLASS "largeMonotype") (CLASS "Y")) "33B")
3054              (HTML:A (@ (HREF "/cgi-bin/ResStatus")) "Check Status")
3055              (DEPARTURE "1997-05-24T07:55:00+1"))))
3056    ; Part of RDF from the XML Infoset
3057        (test (string-concatenate/shared (list-intersperse '(
3058   "<?xml version='1.0' encoding='utf-8' standalone='yes'?>"
3059   "<!-- this can be decoded as US-ASCII or iso-8859-1 as well,"
3060   "  since it contains no characters outside the US-ASCII repertoire -->"
3061   "<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'"
3062   "         xmlns:rdfs='http://www.w3.org/2000/01/rdf-schema#'"
3063   "          xmlns='http://www.w3.org/2001/02/infoset#'>"
3064   "<rdfs:Class ID='Boolean'/>"
3065   "<Boolean ID='Boolean.true'/>"
3066   "<Boolean ID='Boolean.false'/>"
3067   "<!--Info item classes-->"
3068   "<rdfs:Class ID='InfoItem'/>"
3069   "<rdfs:Class ID='Document' rdfs:subClassOf='#InfoItem'/>"
3070   "<rdfs:Class ID='Element' rdfs:subClassOf='#InfoItem'/>"
3071   "<rdfs:Class ID='Attribute' rdfs:subClassOf='#InfoItem'/>"
3072   "<rdfs:Class ID='InfoItemSet'
3073      rdfs:subClassOf='http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag'/>"
3074   "<rdfs:Class ID='AttributeSet' rdfs:subClassOf='#InfoItemSet'/>"
3075   "<!--Info item properties-->"
3076   "<rdfs:Property ID='allDeclarationsProcessed'>"
3077   "<rdfs:domain resource='#Document'/>"
3078   "<rdfs:range resource='#Boolean'/></rdfs:Property>"
3079   "<rdfs:Property ID='attributes'>"
3080   "<rdfs:domain resource='#Element'/>"
3081   "<rdfs:range resource='#AttributeSet'/>"
3082   "</rdfs:Property>"
3083   "</rdf:RDF>")
3084   (string #\newline)))
3085   '((RDF . "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3086     (RDFS . "http://www.w3.org/2000/01/rdf-schema#")
3087     (ISET . "http://www.w3.org/2001/02/infoset#"))
3088   '(*TOP* (@ (*NAMESPACES*
3089         (RDF "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3090         (RDFS "http://www.w3.org/2000/01/rdf-schema#")
3091         (ISET "http://www.w3.org/2001/02/infoset#")))
3092       (*PI* xml "version='1.0' encoding='utf-8' standalone='yes'")
3093       (RDF:RDF
3094        (RDFS:Class (@ (ID "Boolean")))
3095        (ISET:Boolean (@ (ID "Boolean.true")))
3096        (ISET:Boolean (@ (ID "Boolean.false")))
3097        (RDFS:Class (@ (ID "InfoItem")))
3098        (RDFS:Class (@ (RDFS:subClassOf "#InfoItem") (ID "Document")))
3099        (RDFS:Class (@ (RDFS:subClassOf "#InfoItem") (ID "Element")))
3100        (RDFS:Class (@ (RDFS:subClassOf "#InfoItem") (ID "Attribute")))
3101        (RDFS:Class
3102         (@ (RDFS:subClassOf
3103             "http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag")
3104            (ID "InfoItemSet")))
3105        (RDFS:Class
3106         (@ (RDFS:subClassOf "#InfoItemSet") (ID "AttributeSet")))
3107        (RDFS:Property
3108         (@ (ID "allDeclarationsProcessed"))
3109         (RDFS:domain (@ (resource "#Document")))
3110         (RDFS:range (@ (resource "#Boolean"))))
3111        (RDFS:Property
3112         (@ (ID "attributes"))
3113         (RDFS:domain (@ (resource "#Element")))
3114         (RDFS:range (@ (resource "#AttributeSet")))))))
3115         
3116    ; Part of RDF from RSS of the Daemon News Mall
3117        (test (string-concatenate/shared (list-intersperse '(
3118  "<?xml version='1.0'?><rdf:RDF "
3119    "xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' "
3120     "xmlns='http://my.netscape.com/rdf/simple/0.9/'>"
3121     "<channel>"
3122     "<title>Daemon News Mall</title>"
3123     "<link>http://mall.daemonnews.org/</link>"
3124     "<description>Central source for all your BSD needs</description>"
3125     "</channel>"
3126     "<item>"
3127     "<title>Daemon News Jan/Feb Issue NOW Available! Subscribe $24.95</title>"
3128     "<link>http://mall.daemonnews.org/?page=shop/flypage&amp;product_id=880</link>"
3129     "</item>"
3130     "<item>"
3131     "<title>The Design and Implementation of the 4.4BSD Operating System $54.95</title>"
3132     "<link>http://mall.daemonnews.org/?page=shop/flypage&amp;product_id=912&amp;category_id=1761</link>"
3133     "</item>"
3134     "</rdf:RDF>")
3135   (string #\newline)
3136   ))
3137   '((RDF . "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3138     (RSS . "http://my.netscape.com/rdf/simple/0.9/")
3139     (ISET . "http://www.w3.org/2001/02/infoset#"))
3140   '(*TOP* (@ (*NAMESPACES*
3141         (RDF "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
3142         (RSS "http://my.netscape.com/rdf/simple/0.9/")
3143         (ISET "http://www.w3.org/2001/02/infoset#")))
3144       (*PI* xml "version='1.0'")
3145       (RDF:RDF (RSS:channel
3146                  (RSS:title "Daemon News Mall")
3147                  (RSS:link "http://mall.daemonnews.org/")
3148                  (RSS:description "Central source for all your BSD needs"))
3149                (RSS:item
3150                  (RSS:title
3151                    "Daemon News Jan/Feb Issue NOW Available! Subscribe $24.95")
3152                  (RSS:link
3153                    "http://mall.daemonnews.org/?page=shop/flypage&product_id=880"))
3154                (RSS:item
3155                  (RSS:title
3156                    "The Design and Implementation of the 4.4BSD Operating System $54.95")
3157                  (RSS:link
3158                    "http://mall.daemonnews.org/?page=shop/flypage&product_id=912&category_id=1761")))))
3159
3160    (test (string-concatenate/shared (list-intersperse 
3161       '("<Forecasts TStamp='958082142'>"
3162         "<TAF TStamp='958066200' LatLon='36.583, -121.850' BId='724915'"
3163         "  SName='KMRY, MONTEREY PENINSULA'>"
3164         "<VALID TRange='958068000, 958154400'>111730Z 111818</VALID>"
3165         "<PERIOD TRange='958068000, 958078800'>"
3166         "<PREVAILING>31010KT P6SM FEW030</PREVAILING>"
3167         "</PERIOD>"
3168         "<PERIOD TRange='958078800, 958104000' Title='FM2100'>"
3169         "<PREVAILING>29016KT P6SM FEW040</PREVAILING>"
3170         "</PERIOD>"
3171         "<PERIOD TRange='958104000, 958154400' Title='FM0400'>"
3172         "<PREVAILING>29010KT P6SM SCT200</PREVAILING>"
3173         "<VAR Title='BECMG 0708' TRange='958114800, 958118400'>VRB05KT</VAR>"
3174         "</PERIOD></TAF>"
3175         "</Forecasts>")
3176       (string #\newline)
3177       ))
3178          '()
3179          '(*TOP* (Forecasts
3180                   (@ (TStamp "958082142"))
3181                   (TAF (@ (TStamp "958066200")
3182                           (SName "KMRY, MONTEREY PENINSULA")
3183                           (LatLon "36.583, -121.850")
3184                           (BId "724915"))
3185              (VALID (@ (TRange "958068000, 958154400")) "111730Z 111818")
3186              (PERIOD (@ (TRange "958068000, 958078800"))
3187                      (PREVAILING "31010KT P6SM FEW030"))
3188              (PERIOD (@ (Title "FM2100") (TRange "958078800, 958104000"))
3189                      (PREVAILING "29016KT P6SM FEW040"))
3190              (PERIOD (@ (Title "FM0400") (TRange "958104000, 958154400"))
3191                      (PREVAILING "29010KT P6SM SCT200")
3192                      (VAR (@ (Title "BECMG 0708")
3193                              (TRange "958114800, 958118400"))
3194                           "VRB05KT"))))))
3195))
3196
3197(run-test
3198 (newline)
3199 (display "All tests passed")
3200 (newline)
3201)
Note: See TracBrowser for help on using the repository browser.