XPA: XML Processing for ANTLR

XPA was invented out of the feeling there really is something missing in processing and transforming XML data compared to traditional parsing technique. If you see it that way, state of the art processing of XML can be compared to state of the art parsing technique of the early 1960's. Although this comparison is not quite fair, processing and transforming XML could be much more comfortable. XPA tries to close this gap.


From a certain point of view building a parser is a way to access information that is stored in a special pattern and syntax. To do so you normally start processing your data with a lexer that on one hand checks your data for correct syntax and on the other hand filters out redundant data and bundles the real information into small chunks of data called tokens. These tokens will than be fed into the token parser (footnote: One confusing thing in parsing technique terminology is the word "parser". It sometimes refers to the whole program consisting of the lexer, token parser and tree parser and sometimes only the token parser is meant. I will use parser for the whole thing and token parser for the entity parsing tokens coming from the lexer.) which again checks for correct syntax - only on a higher level.

Up to here processing XML with a standard SAX or DOM parser (footnote: If you need more explanations about SAX and DOM take this page as a starting point. If you need more information about XML you can find a starting point here) isn't all that different. Leaving aside the DTD or Schema the XML parser checks if the document is well formed and makes chunks out of the XML data. Like "start of element", "end of element", "text data", etc. This is all like the lexers job in traditional parsing technique. If there is something like a DTD, the XML parser also checks if these chunks of information occur in correct sequence. This can be compared to the actual job of the token parser. Now for the difference: When you write a token or tree parser using grammars you not only write the symbols of your grammar, but also some semantic action taken upon occurrence of a special token in a special context. These actions say what to do with your information after its structure is revealed.

Have a look at the following part of an ANTLR grammar:

rule :
    A
  | B
  ;

The left side of such a rule is a symbol for what is to be parsed in that rule. In this case 'rule', a non-terminal. On the right side, separated by the colon, you find of what this non-terminal 'rule' is made of. In this case it is made up of a terminal 'A' or 'B'. Some people may be more familiar with "name of a rule" for non-terminal and "token" for a terminal. Usually, more information is associated to a token than its name, such as a text or other annotations. You can now add actions to the grammar that are executed when token 'A' or 'B' is parsed by this rule. To access additional information attached to tokens, you can give them names that precede the token followed by a colon:

rule :
    a:A
    { System.out.println("A: "+a.getText()); }
  | b:B
    { System.out.println("B: "+b.getText()); }
  ;

This example prints out the text of token 'A' or 'B' depending on what token we find.

Coming back to XML and DTDs, we can say there is no equivalence to this in standard XML parsing. The DTD (and also Schema) is just a grammar - no annotations, no actions. That is why you have to deal with SAX or DOM to get to your information.

To fix this gap, there are two possible approaches, both of them supported by XPA:

  1. Make XML parsers produce tokens that can be parsed by ANTLR generated token parsers: This effectively doubles the job of the validating XML parser augmented by semantic actions. It is well suited for a processing job.
  2. Make XML parsers generate trees that can be parsed by ANTLR generated tree parsers: This carries on the job of the XML parser and is comparable to the special case in traditional parsing technique where the token parser does nothing, but generate a tree for a tree parser to process. This approach is especially suited for a transformation job.
To make this more understandable, suppose we want to process the following XML fragment.
<component>
  <subcomponent1>Text inside subcomponent1</subcomponent1>
  <subcomponent2>Text inside subcomponent2</subcomponent2>
</component>
Our task might be to access the data inside of element subcomponent1 and subcomponent2. Using XPA as XML interface to ANTLR you can write the following grammar to do this:
component :
    "<component>"
    ( "<subcomponent1>" t1:PCDATA "</subcomponent1>" { System.out.println("Subcomponent1: "+t1.getText()); }
    | "<subcomponent2>" t2:PCDATA "</subcomponent2>" { System.out.println("Subcomponent2: "+t2.getText()); }
    )*
    "</component>"
  ;
Notice that "<component>" and PCDATA are tokens just like 'A' and 'B' in our previous example. By definition PCDATA is the token that represents text when interfaced by XPA. Also by definition tokens for start and end tags have the format <tag-name> resp. </tag-name>. To make it very clear: for every different type of start and end tag in your XML document, there is a token in your ANTLR grammar.
As this form of processing is in competition let us compare this to a straight forward solution with SAX and DOM. We start with SAX and write a handler to process events thrown by the SAX parser. Remember, we need to keep track of the element context ourselves:
    class SAXHandler extends DefaultHandler {
        
        Stack treeStack = new Stack();

	public void characters(char ch[], int start, int length) {
	    String text = new String(ch, start, length);
	    if (treeStack.size() >= 2) {
	        String top = (String)treeStack.pop();
	        String top2 = (String)treeStack.pop();
		treeStack.push(top2);
		treeStack.push(top);
		if (top2.equals("component")) {
		    if (top.equals("subcomponent1")) {
		        System.out.println("Subcomponent1: "+text);
		    } else if (top.equals("subcomponent1")) {
		        System.out.println("Subcomponent1: "+text);
		    }
		}
	    }
	}
   
	public void endElement(String uri, String localName, String qName) {
	    treeStack.pop();
	}
   
	public void startElement(String uri, String localName, String qName, 
				 Attributes attributes) {
            treeStack.push(qName);
	}
    }

Well, this is not what I would call convenient and concise, even though this example is really simple. This shall not be an accusation, as SAX is really meant to be low level. I just wanted to make clear, using SAX directly is not the best choice for accessing your data. Let's now have a look at a more high level solution using the DOM:
Element componentElement = (Element)document.getElementsByTagName("component").item(0);
NodeList subcomponentList1 = componentElement.getElementsByTagName("subcomponent1");
for ( int i = 0; i < subcomponentList1.getLength(); i++ ) {
    Element subcomponentElement = (Element)subcomponentList1.item(i);
    String text = subcomponentElement.getFirstChild().getNodeValue();
    System.out.println("Subcomponent1: "+text);    
}
NodeList subcomponentList2 = componentElement.getElementsByTagName("subcomponent2");
for ( int i = 0; i < subcomponentList2.getLength(); i++ ) {
    Element subcomponentElement = (Element)subcomponentList2.item(i);
    String text = subcomponentElement.getFirstChild().getNodeValue();
    System.out.println("Subcomponent2: "+text);    
}

It is not only that the DOM is a bit cumbersome to use, it also uses quite a lot of memory as the whole XML document has to be kept in memory. This is no problem for our small file, but some files really get large. Also, there is a problem when your file is transmitted over a slow connection. When you start the parse, the next time you get control over what is happening is when the whole tree is constructed. This may take quite some time and blocks processing until the tree is complete.


If memory is not our problem, you can use tree processing which generally more convenient and expressive. Using DOM directly is may not be the best solution, as we have seen in the previous section. XSLT can be an option when your aim is transformation. Now XSLT might be too slow or you do not like the XML syntax or whatever. Consider ANTLR tree parsers that process ASTs created by XPA as an alternative. This is an ANTLR tree grammar that parses our example XML fragment:
component :
    #("<component>"
    ( #("<subcomponent1>" t1:PCDATA { System.out.println("Subcomponent1: "+t1.getText()); } )
    | #("<subcomponent2>" t2:PCDATA { System.out.println("Subcomponent2: "+t2.getText()); } )
    )*
  ;
Now there are special facilities in ANTLR that make very powerful transformations on this tree possible. There is much left to be said about ANTLR tree transformers. I will add a discussion of them later. For now simply download XPA and have a look at the examples.

Back to XPA


Copyright

XPA: Copyright (c) 2003, Oliver Zeigermann

February 2003

www.zeigermann.de