Handling XML in Java

WHAT IS XML?

XML - the Extensible Markup Language - is a "metalanguage" that lets you create and format your own document markups. In other words, it defines the structure of the markups (so that any markup can be processed by a wide variety of software), but it doesn't' actually define the element names or uses. It's a W3C (World Wide Web Consortium) standard.

An XML document contains information held in elements.

An element comprises two tags: a start tag; and an end tag, perhaps with further information, such as other elements or character data, between the tags. An alternative element form combines the start and end tags. Let's see an example of an element:

        <Language Availability="Open Source">
                <Author name="Rasmus Lerdorf"/>
                A Powerful Scripting Language used to embed programatic elements
                within an HTML Web page.
        </Language>

Note:

   * Tags are embedded in < and > characters
   * Tags are case sensitive
   * Every open must be matched with a close
   * A close is either a tag with an extra / on the start of the tag name or
     a trailing / on the end of the open tag
   * Elements may have attributes, specified as name=value in the open tag
   * Attribute values must be quoted with single or double quotes
   * Elements may be nested (i.e. one can appear completely within another)

In the examples through this section, we're going to use data from the Open
Directory project (www.dmoz.org), which is supplied in XML format and is freely distributable. We have a sample data file of around 20k for our examples, and a complete database of 750Mbytes for the latter examples.

Here's a snippet of XML from the ODP:

<ExternalPage about="http://www.wellho.co.uk">
  <d:Title>Well House Consultants Ltd.</d:Title>
  <d:Description>Tcl, PHP, Perl and Java training; regular scheduled
courses in Wiltshire, Southern England, or we come on site
anywhere in the UK or Ireland.
  </d:Description>
</ExternalPage>

Notice here the "d:" in front of many of the tags? Tags can be divided down into groups using different prefixes for different groupings - also known as a Schema.

DEFINING A DOCUMENT TYPE

XML is extremely flexible, to the extent that what's valid in one document may be invalid in another, and elements may have differing meanings and valid attributes between documents. In order for "validating parsers" to flag errors, you may use Document Type Definitions (known as DTDs) to place constraints on how your document looks. Note that a DTD isn't a complete definition; for example, it won't allow you to limit the number of times a particular element occurs within a document.

TOOLS FOR FORMATTING DATA HELD IN XML DOCUMENTS

Frequently, you'll want to format an XML document for presentation to a reader. It might be in HTML, in plain text, or in some other format. The Extensible Stylesheet Language lets you define (through XSL Transformations
- XSLT or XSL Formatting Objects - XSL-FO) ways of formatting your XML
information for presentation. An XSLT Stylesheet is written to define your translation. It should come as no surprise that such a stylesheet is itself an XML document, although one with specific rules (i.e. with a DTD) about what may and may not be included.

XML PARSERS

Whatever language you're using to handle XML documents, and whether or not you're using DTD and XSLT, it's highly unlikely that you'll be writing low-level calls to actually translate the text. Someone else has already written that code, and you'll simply call it in through function or subroutine or method calls within your program. This doesn't remove the need for you to understand the concepts of XML, nor for you to be able to read and correct XML for software and data testing and maintenance purposes.

Usually, the code that translates XML is structured in the form of a PARSER - an object which passes through an incoming XML stream and runs callback methods when it encounters certain types of tag (or data between tags), or translates from XML into an object structure such as DOM.

There are a number of parsers available, including:

Apache Xerces	http://xml.apache.org
Expat (James Clark)	http://www.jclark.com/xml/xp
Sun Crimson	http://xml.apache.org/crimson
IBM XML.4J	http://alphaworks.ibm.com/tech/xml4j
Microsoft MXSML	http://msdn.microsoft.com/xml/default.asp

and you'll need one of these in order to have the actual parsing work done for you. Which to choose? Xerces comes highly recommended and is Open Source, and the author of this module has made heavy use of Expat in "another life" and found it good. Sun's product is said to conform well to the standards, but the Microsoft parser has a reputation for not conforming well to the XML standard, although numerous issues have been addressed in the latest release.

There are three widely available / used parser APIs for handling XML in Java and they are:

		DOM 		SAX		 JDOM

There's also an API known as JAXP for handling XML. Please be aware that JAXP in NOT and XML parser itself - it requires one of the other parsers to be present to do the low level work.

DOM and SAX classes should be included with any parser you download; JAXP will be there with most of them, JDOM tends to no to be there but is
available from http://www.jdom.org
Head reeling? It does for a lot of people. Chances are that if you're interested in XML from Java, you're writing server side applications that will serve HTML output from an XML resource - in other words, XSLT, Servlets and JSP will also be required. Those technologies involve further software / classes, so you'll be looking at Tomcat (Apache) or WebSphere (IBM), and more.

Sun are now offering a "Java Server Web Package" for download from http://www.javasoft.com, designed to include all the elements that you'll need on top of the main Java release - they're packaging not only their own software, but also appropriate Open Source elements such as Tomcat. At the time of writing, the bundle is in "Beta Release" format, and requires the JSDK 1.3.1 or later to be present.

Having got past the various component elements, we can move on to have a look at the individual parser classes.

THE SAX PARSER

Version 2, from summer 2000, has now largely superseded version 1 and the examples here are using that version.

THE BASIC STRUCTURE

You'll start by creating your Parser object (known as an XMLReader), and specifying the underlying parser that's to be used:

String vendorParserClass = "org.apache.xerces.parsers.SAXParser";
XMLReader reader =
XMLReaderFactory.createXMLReader(vendorParserClass);

You'll then do the actual parsing:

InputSource source = new InputSource(args [0]);
reader.parse(source);

We've simply taken the first command line parameter and thrown it into SAX here; you can specify any URI, so a local file of XML or a remote object via HTTP or FTP are all equally available to you.

The parse method actually does the hard work of parsing ... but as we haven't put any callbacks in place, the hard work in this very first example is all for nought.

Two possible exceptions can be thrown:

IOException if there's trouble reading the document file
SAXException if problems occur during parsing
(org.xml.sax.SAXException)

THE CONTENTHANDLER INTERFACE

In order for us to be able to DO something with our parsed XML, we need to define a content handler - a class which implements the ContentHandler interface. To our main parser class, we'll add:

reader.setContentHandler(new DemoContentHandler());

and we'll define our class:

class DemoContentHandler implements ContentHandler {

public DemoContentHandler () {

}

[etc]

}

Do note that the constructor is not defined within the interface, so that you can roll your own with a number of parameters; this is how you can pass a structure (probably a collection object) into the ContentHandler so that it can save away the parser's results into something usable from your code that's later in the application.

There are 11 methods to be implemented for the ContentHandler interface, most of them being declared to throw a SAXException. All are public, void. They are:

setDocumentLocator(Locator locator)
startDocument()
endDocument()
startPrefixMapping(String prefix, string uri)
endPrefixMapping(String prefix)
startElement(String namespaceURI, string localname, string qname, Attributes atts)
endElement(String namespaceURI, string localname, string qname)
characters(char ch[], int start, int length)
ignorableWhitespace(char ch[], int start, int length)
processingInstruction(String target, String data)
skippedEntity(String name)

The DocumentLocator is useful for debugging - it lets you find out later from within your content handler which line of your XML your parsing in the case of errors, for example.

private Locator locator;
public void setDocumentLocator(Locator locator) {
this.locator=locator;
}

To start with, you'll probably want to learn by "dummying out" most of the other methods by providing "do nothing" methods, and then implement them by adding code to them one at a time. A couple that are worthy of further comment
startPrefixMapping and endPrefixMapping relate to Schema (the namespaces of XML),
and are called BEFORE the first tag and AFTER the last tag that's in a particular namespace. skippedEntity and processingInstruction refer to external entity references, also beyond the scope of our first example.

And you're left with
- start and end document handlers
- start and end element handlers
- character handler
- ignorable white space handler
to write.

The Structure of an XML tree is very similar to the structure of a JTree widget (Java Swing classes), and if you're familiar with the JTree, there's an excellent example in "Java and XML" by Brett McLaughlin, published by O'Reilly (chapter 3).

OTHER INTERFACES WITHIN SAX

As well as a ContentHandler, you should provide an ErrorHandler for use in practical applications (you do need to check your XML, don't you?).

Other interfaces also available are:

DTDHandler	If you wish to use document type definitions
EntityResolver	To handle external entities referred to from within the XML

THE DOM PARSER

The Document Object Model (DOM) is a definition of an Object or Structure to hold data. It's arranged into a hierarchy of nodes, which can be navigated (when used in Java) with methods such as getChildren() and getParent().

Great - fabulous idea - we can take more or less any data tree we like and hold it in a model in our program, and write code to navigate through our model. We can make changes to our model too as we run our application.

PARSING WITH DOM

Parsing with DOM is really simple:

import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;

DOMParser parser = new DOMParser();
parser.parse("testfile.xml");
Document doc = parser.getDocument();

Yes, that's it! You may ask "Why not always use DOM in preference to SAX? - parsing is so much simpler!". Alas, the real work of coding with DOM comes at the time that you use the Object Model - coding there can be substantial. You also have to consider efficiency issues as there's an extra level of conversion involved, and the size of the model is also an important factor - with a large file of XML (and remember that the DMOZ is an 800 Mbyte XML document, for example), there's no way that you can use DOM as the model just won't fit into your computer.

DOM IN JAVA - A PRIMER

This module really isn't about the DOM ... it's about XML. But here are some basics from a Java perspective:

DOM is a tree of nodes, the top node being the document itself. Thus (following on from our code above) we can write;

myNodeHandler(doc,0);

and

void myNodeHandler(Node current,int level) {
int currentlevel = level + 1;

And then we can branch based on the type of node that we have (yes, I know it's a document node at present, but this code will become general!)

switch (current.getNodeType()) {

case current.DOCUMENT_NODE:
case current.ELEMENT_NODE:
case current.TEXT_NODE:
case current.CDATA_SECTION_NODE:
case current.COMMENT_NODE:
case current.PROCESSING_INSTRUCTION_NODE:
case current.ENTITY_REFERENCE_NODE:
case current.DOCUMENT_TYPE_NODE:

}

If we now look at an element node, we can look into what THAT contains:

  NodeList children = current.getChildNodes();
  for (int i=0; i<children.getLength(); i++) {
   myNodeHandler(children,item(i),currentlevel);
   }

and thus traverse our whole tree. This process is often referred to as SERIALISATION as it passes through all the nodes of a document. Note that it's using recursion in Java - myNodeHandler calls itself; we've used the currentlevel variable to tell us how many levels deep we are (in the skeleton above, we really don't need that variable, but we will require it later if, for example, we want to draw out the DOM as an indented tree structure).

Element Nodes can also include data in the form of attributes, so in addition to the recursion handler you'll want to deal with them:

NamedNodeMap attribs = current.getAttributes();
for (int i=0; i<attribs.getLength();i++){
  Node now = attribs.item(i);
  doSomethingWith(now.getNodeName(),
    now.getNodeValue());
  }

Handling other nodes is straightforward by comparison to the ELEMENT node, for example:

case current.TEXT_NODE:
reportText(current.getNodeValue());

Note - doSomethingWith and reportText is code you'll need to write depending on what you actually wish to do with the result of parsing the DOM tree.

As you get more advanced, you'll find there are methods that let you modify the DOM tree, and even create your own. With careful serialisation using the techniques outlined above, you'll then be able to write your own modified XML back out from your Java application, thus completing the cycle.

THE JDOM PARSER

JDOM was build specifically for Java; it's an implementation of DOM, so all that you've just learnt about DOM itself applies.

Instead of using NodeList and NamedNodeMap classes (as done in DOM), you'll find that JDOM uses Java collections such as java.util.List and java.util.Map.

THE JAXP APA

JAXP (Java API for XML Parsing) provides a layer over the existing APIs to allow for vendor-neutral parsing of code. This was a thin layer at JAXP 1.0; it has been thickened somewhat at JAXP 1.1

See also Deploying Java web services (including XML)