- 浏览: 285386 次
- 性别:
- 来自: 福州
最新评论
-
1641606815:
可以考虑使用HttpClient实现,那都是封装好的东西,使用 ...
Java写的爬虫的基本程序 -
SE_XiaoFeng:
yajie 写道只有对http协议才行,假如有ftp协议呢。不 ...
Java写的爬虫的基本程序 -
dongtianlaile:
如果是https网站,怎么办?
Java写的爬虫的基本程序 -
yeelor:
J2CMS是一个基于JAVAEE平台的轻量极的敏捷开发架构,实 ...
java的CMS,前途在哪里 -
yeelor:
j2cms
java的CMS,前途在哪里
Jericho HTML Parser
Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
It is an open source library released under both the Eclipse Public License (EPL) and GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in either one of these licence documents.
The javadocs provide comprehensive documentation of the entire API, as well as being a very useful reference on aspects of HTML and XML in general.
Visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/ for downloads and support.
You can also rate the project highly at http://freshmeat.net/projects/jerichohtml/
Release notes for each version can be found in a file called release.txt in the project root directory.
Features
The library distinguishes itself from other HTML parsers with the following major features:
- The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
- ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
- It is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
- Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
- Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
- The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
- The row and column number of each position in the source document is easily accessible.
- Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
- Custom tag types can be easily defined and registered for recognition by the parser.
- Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy.
- Built-in functionality to render HTML markup with simple text formatting.
- Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.
Sample Programs
The samples/console
directory in the download package contains sample programs for performing common tasks and demonstrating the functionality of the library. The .bat
files can be run directly on a MS-Windows operating system, or the following syntax can be used on a UNIX based operating system from the samples/console
directory:
java -classpath classes;../lib/jericho-html-x.x.jar ProgramName
where x.x
is the current release number and ProgramName
is the name of the sample program to run.
The following sample programs are available:
ConvertStyleSheets.java | Demonstrates how to detect all external style sheets and place them inline into the document. |
DisplayAllElements.java | Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML. |
ExtractText.java | Demonstrates the use of the TextExtractor class that extracts all of the text from a document, as well as the title, description, keywords and links. |
FindSpecificTags.java | Demonstrates how to search for tags with a specified name, in a specified namespace, or special tags such as document type declarations, XML declarations, XML processing instructions, common server tags, PHP tags, Mason tags, and HTML comments. |
FormControlDisplayCharacteristics.java | Demonstrates setting the display characteristics of individual form controls. This allows a control to be disabled, removed, or replaced with a plain text representation of its value (display value). The new document is written to a file called NewForm.html |
FormFieldCSVOutput.java | Demonstrates the use of the FormFields.getColumnValues(Map) method to store form data in a .CSV file, automatically creating separate columns for fields that can contain multiple values (such as checkboxes). The output is written to a file called FormData.csv |
FormFieldList.java | Demonstrates the use of the Segment.findFormFields() method to list all form fields and their associated controls in a document. |
FormFieldSetValues.java | Demonstrates setting the values of form controls, which is best done via the FormFields object. The new document is written to a file called NewForm.html |
FormatSource.java | Demonstrates the use of the SourceFormatter class that formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent. Also known as a "source beautifier". |
RenderToText.java | Demonstrates the use of the Renderer class that performs a simple text rendering of HTML markup, similar to the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails. |
Encoding.java | Demonstrates the use of the EncodingDetector class and how to determine the encoding of a source document. |
SplitLongLines.java | Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines. |
Building
The build and sample files are implemented as DOS .bat files only. This is because I wanted to avoid the need to install ANT for such a simple library. Sorry to all the unix users for the inconvenience.
On the Drawing Board...
- Ability to generate a JDOM document, making it a JTidy alternative
- Online interactive sample programs - please let me know if you are willing to host the FormatSource.jsp page on your web server
- .NET (DotNet) version if enough interest shown (register you interest via the forums)
Alternative HTML Parsers
This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.
- JavaCC HTML Parser by Quiotix Corporation (http://www.quiotix.com/downloads/html-parser/)
GNU GPL licence, expensive licence fee to use in commercial application. Does not support document structure (parses into a flat node stream). - Demonstrational HTML 3.2 parser bundled with JavaCC. Virtually useless.
- JTidy (http://jtidy.sourceforge.net/)
Supports document structure, but by its very nature it "tidies" up anything it doesn't like in the source document. On first glance it looks like the positions of nodes in the source are accessible, at least in protected start and end fields in the Node class, but these are pointers into a different buffer and are of no use. - javax.swing.text.html.parser.Parser
Comes standard in the JDK. Supports document structure. Does not track the positions of nodes in the source text, but can be easily modified to do so (although not sure of legal implications of modifications). Requires a DTD to function, but only comes with HTML3.2 DTD which is unsuitable. Even if an HTML 4.01 DTD were found, the parser itself might need tweaking to cater for the new element types. The DTD needs to be in the format of a "bdtd" file, which is a binary format used only by Sun in this parser implementation. I have found many requests for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain unanswered. Building it from scratch is not so easy. - Kizna HTML Parser v1.1 (http://htmlparser.sourceforge.net/)
GNU LGPL licence. Version 1.1 was very simple without support for document structure. I have since revisited this project at sourceforge (early 2004), where version 1.4 is now available. There are now two separate libraries, one with and one without document structure support. It claims to now also be capable of reproducing source text verbatim. - CyberNeko HTML Parser (http://www.apache.org/~andyc/neko/doc/html/index.html)
Apache-style licence. Supports document structure. Based on the very popular Xerces XML parser. At the time of evaluation this parser didn't regenerate the source accurately enough.
Sponsors: |
Corporate Translations |
Taking Care of Trees |
发表评论
-
java的CMS,前途在哪里
2009-06-12 09:35 22598最近在用CMS做项目。由 ... -
搞定struts中cookie
2008-11-18 14:34 3034今天碰到的一个问题:配置页提交一个信息到struts的acti ... -
java;jsp;tomcat;mysql;hibernate;j2ee 编码中文乱码全面解决方案
2008-04-24 11:13 4961******************************* ... -
用HttpClient来模拟浏览器GET POST
2008-03-14 09:06 5365作者:jaddy0302 日期:2006-12 ... -
Velocity模板引擎体验
2007-12-28 10:20 2209不少人看过或了解过Velocity,名称字面翻译为:速度、速率 ... -
详细解析Java中抽象类和接口的区别
2007-12-26 10:57 1396在Java语言中, abstract class 和inter ... -
利用Java生成静态HMTL页面的方法收集
2007-12-24 09:59 15318生成静态页面技术解决方案之一 转载者前言:这是一个全面的js ... -
HttpClient+Jericho HTML Parser 实现网页的抓取
2007-12-22 21:06 9983Jericho HTML Parser是一个简 ... -
Java写的爬虫的基本程序
2007-12-22 12:51 11498这是一个web搜索的基本程序,从命令行输入搜索条件(起始的UR ... -
网络蜘蛛基本原理
2007-12-22 12:44 3312网络蜘蛛即Web Spider,是 ... -
一个java的web日历实现
2007-04-24 18:05 3625相信大家都看到很 ... -
《让僵冷的翅膀飞起来》系列之三——从Adapter模式到Decorator模式
2007-04-22 20:13 1679一、 考察对象的Adapter模式 从上文看到,经过引入Ada ... -
《让僵冷的翅膀飞起来》系列之二——从实例谈Adapter模式
2007-04-22 20:13 1596在拙文《<让僵冷的翅膀飞起来>系列之一——从实例谈 ... -
《让僵冷的翅膀飞起来》系列之一--从实例谈OOP、工厂模式和重构
2007-04-22 20:12 1679有了翅膀才能飞,欠缺 ...
相关推荐
jericho html Parser 源码
A simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid...
Jericho HTML Parser是一个Java库,允许对HTML文档的各个部分(包括服务器端标签)进行分析和操作,同时逐字再现任何无法识别或无效HTML。
JerichoHtmlParser使用介绍.pdf
jericho-html-3.2.zip
jericho-html-3.0.zip
强大的HTML文档解析包。很方便的就能查找标签
基于JerichoHTMLParser的html信息抽取.pdf
python库。 资源全名:jericho-1.1.1.tar.gz
面对非结构化的html,无论使用DOM或SAX,都有其不足之处。本文对比DOM、SAX的解析方式,介绍一种开源的JerichoHTMLParser解析方式,其在对html页面信息进行直接解析时,可以获得一个比较好的解析效果。最后,用实验证明...
当前正在开发中,用于4chan的基于C#的开源Tripcode生成器。 最终目标是加速CUDA。
耶利哥 用Java重写我的个人智能家居。 对其他用户无用。 Lincense:GPLv3
用于与 Blogger 和 Manila XML-RPC API 交互的 Java GUI 和库。
完全基于java的技术 XML解析,HTML解析,开源组件应用。...jericho-html-2.5:解析HTML文件 commons-httpclient:读取WEB页面内容工具 其他必须的辅助引用包括: commons-codec commons-logging jaxen
第一个pre-alpha演示版本(0.91)发布了。 该播放器具有加载和执行winamp“ maki-wasabi”现代脚本的基本能力,可进行自定义可视化。 并不是所有的maki对象和脚本功能都仍然实现,并且它们的某些方法仍然是空的,...
客户端资源管理工具 该项目是一个跨平台的桌面应用程序,旨在帮助管理客户政策和帐户效果数据。 技术领域 框架 通过 用户界面 使用设计的组件 数据库 NoSQL数据库库用于持久性 ... Ascync使用Redux Thunk中间件 ...
HttpClient3.0的jar包,包含了commons-codec-1.3.jar,commons-httpclient-3.1.jar,commons-logging1.1.jar,dom4j-1.6.1.jar,htmlcleaner-2.2.jar,htmlparser.jar,jericho-html-3.1.jar
使用Jericho Library解析董事会html源代码。 您应该将jericho 3.3(在/ app / src / libs /中)应用于bbs解析我建议对重复的html标签使用“ for”语法。 版权所有(C)2015 Gyeongrok Kim 每个人都可以参考此源代码...
jericho-html-2.5:解析HTML文件 commons-httpclient:读取WEB页面内容工具 其他必须的辅助引用包括: commons-codec commons-logging jaxen 基本业务流程描述 通过XML文件定义抓取目标 通过...