XML - Notation
by satheesh[ Edit ] 2010-01-28 10:32:16
The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form
symbol ::= expression
Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lowercase letter. Literal strings are quoted.
Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
#xN
where N is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 is N. The number of leading zeros in the #xN form is insignificant.
[a-zA-Z], [#xN-#xN]
matches any Char with a value in the range(s) indicated (inclusive).
[abc], [#xN#xN#xN]
matches any Char with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.
[^a-z], [^#xN-#xN]
matches any Char with a value outside the range indicated.
[^abc], [^#xN#xN#xN]
matches any Char with a value not among the characters given. Enumerations and ranges of forbidden values can be mixed in one set of brackets.
"string"
matches a literal string matching that given inside the double quotes.
'string'
matches a literal string matching that given inside the single quotes.
These symbols may be combined to match more complex patterns as follows, where A and B represent simple expressions:
(expression)
expression is treated as a unit and may be combined as described in this list.
A?
matches A or nothing; optional A.
A B
matches A followed by B. This operator has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D).
A | B
matches A or B.
A - B
matches any string that matches A but does not match B.
A+
matches one or more occurrences of A. Concatenation has higher precedence than alternation; thus A+ | B+ is identical to (A+) | (B+).
A*
matches zero or more occurrences of A. Concatenation has higher precedence than alternation; thus A* | B* is identical to (A*) | (B*).
Other notations used in the productions are:
/* ... */
comment.
[ wfc: ... ]
well-formedness constraint; this identifies by name a constraint on well-formed documents associated with a production.
[ vc: ... ]
validity constraint; this identifies by name a constraint on valid documents associated with a production.