XML Parser:DOM + XPath -

george.gu

浏览: 70976 次
性别:
来自: 北京

最近访客更多访客>>

luojianbing

highinsky0109

callmeNeo

lfrick

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

XML Parser:DOM + XPath

博客分类：

每日一得

XML Google JDK Blog

There are many kinds of XML Parsers in Java:

DOM (JDK embedded DOM implementation)
SAX
JDOM (It is an alternative to DOM and SAX)
Digester (Jakarta commons Digester)
JAXB(OXM, JDK1.6 embedded JAXB2.0 implementation)
dom4j
Xerces
KXML
...

In fact, you can list more if google "xmlparser java". However I only list what I known or used in my previous projects. I will talk about them one by one and in this blog I would like to only talk about DOM + XPath.

As usual, why I want to talk about such kind of "OLD" questions for me? The fact is that, I am re-factoring a platform which parsing OMADM DDF DTD with DOM and I cannot answer myself the question:

What's the difference between DOM Node and DOM Element?

Node Types

After study some documents which I am 100% sure I had read several years ago, I got the old answer for that:

The Node object represents a single node in document tree.
There are many types of Node used to represent dedicated architecture of XML document.

NodeType	Description	Children Nodes
Element	Represents an element	Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference
Attr	Represents an attribute	Text, EntityReference
Text	Represents textual content in an element or attribute	None
CDATASection	Represents a CDATA section in a document (text that will NOT be parsed by a parser)	None
Document	Represents the entire document (the root-node of the DOM tree)	Element (max. one), ProcessingInstruction, Comment, DocumentType

Node Types table (description and relationship with each other)

NodeType	Named Constant	NodeType Constant	getNodeName() return	getNodeValue() return
Element	ELEMENT_NODE	1	Element name/ tagName	Null
Attr	ATTRIBUTE_NODE	2	Attribute name	Attribute value
Text	TEXT_NODE	3	#text	content of node
CDATASection	CDATA_SECTION_NODE	4	#cdata-section	content of node
Document	DOCUMENT_NODE	9	#document	null

Node Types table (basic properties)

Element is a kind of Node or It is sub-class of Node interface in Java point of view.
If a Node has NodeType ==1, we can say it is a Element.
Element.getTagName equals to Element.getNodeName().

Here I only list the common used node types in projects. If you want to know more details on other Node types, please refer to w3school.com specification: http://www.w3schools.com/dom/dom_nodetype.asp.

JDK embedded DOM Parser

Normally we can get etire XML Document object by using following java code ():

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

factory.setValidating(false);

factory.setNamespaceAware(false);

// factory.setSchema (myschema);

DocumentBuilder parser = factory.newDocumentBuilder();

// parser.setEntityResolver (new MyEntityResolver ());

// parser.setErrorHandler (new MyParseErrorHandler ());

org.w3c.dom.Document document = parser.parse(InputStream|String|File|InputSource);

Then you can use org.w3c.dom.Document.getDocumentElement() to get Root element. Why? From Document Node description, we know that Document represent the entire document and its include a unique Element node. So here we can easily get Root Element by calling Document.getDocumentElement().

Once we got the Root element, we can introspect it to parse XML data.

Useful interfaces for DOM parser

Here only list the key methods and that used often, for other methos please refer to javadoc.

Node.getNodeType(): short

Get node type (see previous Node Types table). It is very useful for your parser, because different types of Nodes will provide you important data (by dedicated method invocation).

Node.getChildNodes(): NodeList

Get a NodeList that contains all children of this node.

Node.getAttributes(): NamedNodeMap

Get a NamedNodeMap containing the attributes of this node (if it is an Element) or null otherwise.

Node.getNodeName():String

See previous Node Types table.

Node.getNodeValue():String

See previous Node Types table.

Node.getTextContent(): String

Return text content of this node and its descendants.

If current Node is a Root element, getTextContent() will return all the text content inside document.

Element.getElementsByTagName(String name):NodeList

Returns a NodeList of all descendant Elements with a given tag name, in document order.

Introspect XML with DOM defined methods

We can parse the XML document now node by node with previous interfaces:

Step 1: Use DocumentBuilder to load XML as Document object;

Step 2: Get Root Element and get its ChildNodes List.

Step 3: loop each Node in ChildNodes (Upper level) to Check Node Type and parse it with your business logical. if there are child nodes (Lower level) for current Node, pause current Node parsing and loop lower level child nodes until all of them processed and weekup Upper level Node processing.

Step 4: if all of the childs node processed, should reach the end of XML document.

Just draft summarize, will be updated later.

Locate Node with XPath

From previous design, we have to loop a lots of Node if we just want to get a element text like following:

/bookstore/category/country/book/author.

Thanks to XPath,it can help us to locate Element easily by specify element as file path in file system.

For more details on XPath syntax, please refer to w3school.com:http://www.w3schools.com/xpath/default.asp.

You can create a XPath instance as following:

XPathFactory factory = XPathFactory.newInstance();

XPath xpath = factory.newXPath(); // Create a new XPath instance

Then You can get a NodeList with "Element tag = author":

XPathExpression xexpr = xpath.compile("/bookstore/category/country/book/author");

NodeList nodes = (NodeList) xexpr.evaluate(document, XPathConstants.NODESET);

You can also get a node as following:

Node node = (Node) xexpr.evaluate(document, XPathConstants.NODE);

But if you are not sure if the target Node is unique or not, try to get NodeList instead of unique Node.

Adventage and Disadventage:

As we can see from DOM parsing methods, DOM will load all the document into memory in order to let you loop different nodes easily. So it could be an issue when you design a system which could exchange big XML docuemtn file. In this case some other XML parser, like digester and some other SAX related parsers could be an alternative.

But I always think DOM provide flexible solution to parse XML defintion with a lot of self-reference element, like OMADM DDF node: Node(Node+). Using DOM, we can write our own recursion parser like what I talked in chapter "Introspect XML with DOM defined methods".

Maybe some other parser has better solution with specific path expression that I donot know. So I will see.

分享到：

Java Regular Expression (Java正则表达式 ... | File upload and download in Java Web App ...

2011-04-23 06:30
浏览 1158
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

XML Parser:DOM + XPath

Node Types

JDK embedded DOM Parser

Useful interfaces for DOM parser

Node.getNodeType(): short

Node.getChildNodes(): NodeList

Node.getAttributes(): NamedNodeMap

Node.getNodeName():String

Node.getNodeValue():String

Node.getTextContent(): String

Element.getElementsByTagName(String name):NodeList

Introspect XML with DOM defined methods

Locate Node with XPath

Adventage and Disadventage:

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

XML Parser:DOM + XPath

Node Types

JDK embedded DOM Parser

Useful interfaces for DOM parser

Node.getNodeType(): short

Node.getChildNodes(): NodeList

Node.getAttributes(): NamedNodeMap

Node.getNodeName():String

Node.getNodeValue():String

Node.getTextContent(): String

Element.getElementsByTagName(String name):NodeList

Introspect XML with DOM defined methods

Locate Node with XPath

Adventage and Disadventage:

评论

发表评论

相关推荐

javax.naming.CommunicationException: remote side declared peer gone on this JVM.

Generate special format numbers

Singleton Service in Weblogic Cluster

Scheduled ThreadPool Executor suppressed or stopped after error happen

Bad version number in .class file

User Data Header in SMPP SUBMIT_SM

jQuery study

Java is Pass-by-Value or Pass-by-Reference?

java.util.Properties: a subclass of java.util.Hashtable

Jmock usage

Oracle Index Usage

AOP(2)：AOP与动态代理JDK Proxy and Cglib Proxy

AOP(1)：应用中的几个小故事

异步系统设计：push vs pull

Velocity Usage

Java Regular Expression (Java正则表达式)

File upload and download in Java Web Application.

Manage zip content using Java APIs

Beanshell: how and where to use beanshell

OXM: JAXB2.0 in JDK1.6

最近访客更多访客>>