XPA was invented out of the feeling there really is something missing in processing and transforming XML data compared to traditional parsing technique. If you see it that way, state of the art processing of XML can be compared to state of the art parsing technique of the early 1960's. Although this comparison is not quite fair, processing and transforming XML could be much more comfortable. XPA tries to close this gap.
From a certain point of view building a parser is a way to access information that is stored in a special pattern and syntax. To do so you normally start processing your data with a lexer that on one hand checks your data for correct syntax and on the other hand filters out redundant data and bundles the real information into small chunks of data called tokens. These tokens will than be fed into the token parser (footnote: One confusing thing in parsing technique terminology is the word "parser". It sometimes refers to the whole program consisting of the lexer, token parser and tree parser and sometimes only the token parser is meant. I will use parser for the whole thing and token parser for the entity parsing tokens coming from the lexer.) which again checks for correct syntax - only on a higher level.
Up to here processing XML with a standard SAX or DOM parser (footnote: If you need
more explanations about SAX and DOM take this page as a
starting point. If you need more information about XML you can find a
starting point here) isn't
all that different. Leaving aside the DTD or Schema the XML
parser checks
if the document is well formed and makes chunks out of the XML
data. Like "start of element", "end of element", "text data", etc.
This is all like the lexers job in traditional parsing technique.
If there is something like a DTD, the XML parser also checks if these
chunks of information occur in correct sequence. This can be compared
to the actual job of the token parser. Now for the difference: When you
write a token or tree parser using grammars you not only write the
symbols of your
grammar, but also some semantic action taken upon occurrence of a
special token in a special context. These actions say what to do with
your information after its structure is revealed.
Have a look at the following part of an ANTLR grammar:
rule :
A
| B
;
The left side of such a rule is a symbol for what is to be parsed in that
rule. In this case 'rule', a non-terminal. On the right side, separated
by the colon, you find of what this non-terminal 'rule' is made of. In
this case it is made up of a terminal 'A' or 'B'. Some people may be
more familiar with "name of a rule" for non-terminal and "token" for a
terminal. Usually, more information is associated to a token than its
name, such as a text or other annotations. You can now add actions to
the grammar that are executed when token 'A' or 'B' is parsed by this
rule. To access additional information attached to tokens, you can
give them names that precede the token followed by a colon:
rule :
a:A
{ System.out.println("A: "+a.getText()); }
| b:B
{ System.out.println("B: "+b.getText()); }
;
This example prints out the text of token 'A' or 'B' depending on what
token we find.
Coming back to XML and DTDs, we can say there is no equivalence to this in standard XML parsing. The DTD (and also Schema) is just a grammar - no annotations, no actions. That is why you have to deal with SAX or DOM to get to your information.
To fix this gap, there are two possible approaches, both of them supported by XPA:
<component> <subcomponent1>Text inside subcomponent1</subcomponent1> <subcomponent2>Text inside subcomponent2</subcomponent2> </component>Our task might be to access the data inside of element subcomponent1 and subcomponent2. Using XPA as XML interface to ANTLR you can write the following grammar to do this:
component : "<component>" ( "<subcomponent1>" t1:PCDATA "</subcomponent1>" { System.out.println("Subcomponent1: "+t1.getText()); } | "<subcomponent2>" t2:PCDATA "</subcomponent2>" { System.out.println("Subcomponent2: "+t2.getText()); } )* "</component>" ;Notice that "<component>" and PCDATA are tokens just like 'A' and 'B' in our previous example. By definition PCDATA is the token that represents text when interfaced by XPA. Also by definition tokens for start and end tags have the format <tag-name> resp. </tag-name>. To make it very clear: for every different type of start and end tag in your XML document, there is a token in your ANTLR grammar.
class SAXHandler extends DefaultHandler { Stack treeStack = new Stack(); public void characters(char ch[], int start, int length) { String text = new String(ch, start, length); if (treeStack.size() >= 2) { String top = (String)treeStack.pop(); String top2 = (String)treeStack.pop(); treeStack.push(top2); treeStack.push(top); if (top2.equals("component")) { if (top.equals("subcomponent1")) { System.out.println("Subcomponent1: "+text); } else if (top.equals("subcomponent1")) { System.out.println("Subcomponent1: "+text); } } } } public void endElement(String uri, String localName, String qName) { treeStack.pop(); } public void startElement(String uri, String localName, String qName, Attributes attributes) { treeStack.push(qName); } }
Element componentElement = (Element)document.getElementsByTagName("component").item(0); NodeList subcomponentList1 = componentElement.getElementsByTagName("subcomponent1"); for ( int i = 0; i < subcomponentList1.getLength(); i++ ) { Element subcomponentElement = (Element)subcomponentList1.item(i); String text = subcomponentElement.getFirstChild().getNodeValue(); System.out.println("Subcomponent1: "+text); } NodeList subcomponentList2 = componentElement.getElementsByTagName("subcomponent2"); for ( int i = 0; i < subcomponentList2.getLength(); i++ ) { Element subcomponentElement = (Element)subcomponentList2.item(i); String text = subcomponentElement.getFirstChild().getNodeValue(); System.out.println("Subcomponent2: "+text); }It is not only that the DOM is a bit cumbersome to use, it also uses quite a lot of memory as the whole XML document has to be kept in memory. This is no problem for our small file, but some files really get large. Also, there is a problem when your file is transmitted over a slow connection. When you start the parse, the next time you get control over what is happening is when the whole tree is constructed. This may take quite some time and blocks processing until the tree is complete.
component : #("<component>" ( #("<subcomponent1>" t1:PCDATA { System.out.println("Subcomponent1: "+t1.getText()); } ) | #("<subcomponent2>" t2:PCDATA { System.out.println("Subcomponent2: "+t2.getText()); } ) )* ;Now there are special facilities in ANTLR that make very powerful transformations on this tree possible. There is much left to be said about ANTLR tree transformers. I will add a discussion of them later. For now simply download XPA and have a look at the examples.
XPA: Copyright (c) 2003, Oliver Zeigermann
February 2003