`
george.gu
  • 浏览: 70976 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

XML Parser:DOM + XPath

阅读更多

There are many kinds of XML Parsers in Java:

  1. DOM (JDK embedded DOM implementation)
  2. SAX
  3. JDOM (It is an alternative to DOM and SAX)
  4. Digester (Jakarta commons Digester)
  5. JAXB(OXM, JDK1.6 embedded JAXB2.0 implementation)
  6. dom4j
  7. Xerces
  8. KXML
  9. ...
In fact, you can list more if google "xmlparser java". However I only list what I known or used in my previous projects. I will talk about them one by one and in this blog I would like to only talk about DOM + XPath.

As usual, why I want to talk about such kind of "OLD" questions for me? The fact is that, I am re-factoring a platform which parsing OMADM DDF DTD with DOM and I cannot answer myself the question:
  • What's the difference between DOM Node and DOM Element?

Node Types

After study some documents which I am 100% sure I had read several years ago, I got the old answer for that:
  • The Node object represents a single node in document tree.
  • There are many types of Node used to represent dedicated architecture of XML document.

NodeType

Description

Children Nodes

Element

Represents an element

Element, Text, Comment,

ProcessingInstruction,

CDATASection, EntityReference

Attr

Represents an attribute

Text, EntityReference

Text

Represents textual content in an element

or attribute

None

CDATASection

Represents a CDATA section in a document

(text that will NOT be parsed by a parser)

None

Document

Represents the entire document

(the root-node of the DOM tree)

Element (max. one),

ProcessingInstruction,

Comment, DocumentType

Node Types table (description and relationship with each other)

NodeType


Named Constant

NodeType

Constant

getNodeName()

return

getNodeValue()

return

Element

ELEMENT_NODE

1

Element name/

tagName

Null

Attr

ATTRIBUTE_NODE

2

Attribute name

Attribute value

Text

TEXT_NODE

3

#text

content of node

CDATASection

CDATA_SECTION_NODE

4

#cdata-section

content of node

Document

DOCUMENT_NODE

9

#document

null

Node Types table (basic properties)
  • Element is a kind of Node or It is sub-class of Node interface in Java point of view. 
  • If a Node has NodeType ==1, we can say it is a Element.
  • Element.getTagName equals to Element.getNodeName().
Here I only list the common used node types in projects. If you want to know more details on other Node types, please refer to w3school.com specification: http://www.w3schools.com/dom/dom_nodetype.asp.

JDK embedded DOM Parser

Normally we can get etire XML Document object by using following java code ():

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(false);
// factory.setSchema (myschema);
DocumentBuilder parser = factory.newDocumentBuilder();
// parser.setEntityResolver (new MyEntityResolver ());
// parser.setErrorHandler (new MyParseErrorHandler ());
org.w3c.dom.Document document = parser.parse(InputStream|String|File|InputSource);

Then you can use org.w3c.dom.Document.getDocumentElement() to get Root element. Why? From Document Node description, we know that Document represent the entire document and its include a unique Element node. So here we can easily get Root Element by calling Document.getDocumentElement().

Once we got the Root element, we can introspect it to parse XML data.

Useful interfaces for DOM parser

Here only list the key methods and that used often, for other methos please refer to javadoc.

 

Node.getNodeType(): short

Get node type (see previous Node Types table). It is very useful for your parser, because different types of Nodes will provide you important data (by dedicated method invocation).

Node.getChildNodes(): NodeList

Get a NodeList that contains all children of this node.

Node.getAttributes(): NamedNodeMap

Get a NamedNodeMap containing the attributes of this node (if it is an Element) or null otherwise.

Node.getNodeName():String

See previous Node Types table.

Node.getNodeValue():String

See previous Node Types table.

Node.getTextContent(): String

Return text content of this node and its descendants.
If current Node is a Root element, getTextContent() will return all the text content inside document.  

Element.getElementsByTagName(String name):NodeList 

Returns a NodeList of all descendant Elements with a given tag name, in document order.

Introspect XML with DOM defined methods

We can parse the XML document now node by node with previous interfaces:

Step 1: Use DocumentBuilder to load XML as Document object;
Step 2: Get Root Element and get its ChildNodes List.
Step 3: loop each Node in ChildNodes (Upper level) to Check Node Type and parse it with your business logical.  if there are child nodes (Lower level) for current Node, pause current Node parsing and loop lower level child nodes until all of them processed and weekup Upper level Node processing.
Step 4: if all of the childs node processed, should reach the end of XML document.

Just draft summarize, will be updated later.

Locate Node with XPath

From previous design, we have to loop a lots of Node if we just want to get a element text like following:
/bookstore/category/country/book/author.
Thanks to XPath,it can help us to locate Element easily by specify element as file path in file system.
For more details on XPath syntax, please refer to w3school.com:http://www.w3schools.com/xpath/default.asp.

You can create a XPath instance as following:
XPathFactory factory = XPathFactory.newInstance(); 
XPath xpath = factory.newXPath(); // Create a new XPath instance

Then You can get a NodeList with "Element tag = author":
XPathExpression xexpr = xpath.compile("/bookstore/category/country/book/author"); 
NodeList nodes = (NodeList) xexpr.evaluate(document, XPathConstants.NODESET);

You can also get a node as following:
Node node = (Node) xexpr.evaluate(document, XPathConstants.NODE);
But if you are not sure if the target Node is unique or not, try to get NodeList instead of unique Node.

Adventage and Disadventage:

As we can see from DOM parsing methods, DOM will load all the document into memory in order to let you loop different nodes easily. So it could be an issue when you design a system which could exchange big XML docuemtn file. In this case some other XML parser, like digester and some other SAX related parsers could be an alternative.

But I always think DOM provide flexible solution to parse XML defintion with a lot of self-reference element, like OMADM DDF node: Node(Node+). Using DOM, we can write our own recursion parser like what I talked in chapter "Introspect XML with DOM defined methods".
Maybe some other parser has better solution with specific path expression that I donot know.  So I will see.
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics