使用Jsoup抽取数据

fuliang

浏览: 1637935 次
性别:
来自: 北京

最近访客更多访客>>

依然任逍遥

stephenworld

lli

samwalt

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Search Engine

jQuery HTML5 CSS

Jsoup是一个Java的HTML解析器，提供了非常方便的抽取和操作HTML文档方法，可以结合DOM，CSS和Jquery类似的方法来定位和得到节点的信息。
有着和Jquery一样强大的select和pipeline的API。
我们以从58同城网抽取租房信息为例,来说明如何使用它：

package test

import org.jsoup.nodes.Document
import java.util.HashMap
import org.jsoup.Jsoup
/**
 * Author: fuliang
 * http://fuliang.iteye.com
 */
class HouseEntry(var title: String,var link: String,var price: Integer, var houseType: String, var date: String){
	override def toString(): String = {
		return String.format("title: %s\tlink:%s\tprice:%d\thouseType:%s\tdate:%s\n", title,link,price,houseType,date);
	}
}

class HouseRentCrawler{
	def crawl(url: String,keyword: String,lowRange: Int,highRange: Int): List[HouseEntry] = {
		var doc = fetch(url,keyword,lowRange,highRange);
		return extract(doc);
	}

	private def fetch(url:String,keyword: String,lowRange: Int,highRange: Int): Document = {
		var params = new HashMap[String,String]();
		params.put("final","1");
		params.put("jump","2");
		params.put("searchtype","3");
		params.put("key",keyword);
		params.put("MinPrice",lowRange + "_" + highRange);
		
	    return Jsoup.connect(url).data(params)
									.userAgent("Mozilla")
									.timeout(10000)
									.get();
	}
	
	private def extract(doc: Document):  List[HouseEntry] = {
		val elements = doc.select("#infolist > tr:not(.dev)");
		var houseEntries = List[HouseEntry]();
		for(val i <- 0 until elements.size()){
			val entry = elements.get(i);
			val fields = entry.select("td"); 
			val title = fields.get(0).text();
			val link = fields.get(0).select("a[class=t]").attr("href");
			val price = fields.get(1).text().toInt;
			val houseType = fields.get(2).text();
			val date = fields.get(3).text();
			val houseEntry = new HouseEntry(title,link,price,houseType,date);
			houseEntries ::= houseEntry;
		}
		return houseEntries;
	}
}

object HouseRentCrawler{
	def main(args: Array[String]) {
		val url = "http://bj.58.com/zufang";
		val crawler = new HouseRentCrawler();
		val houseEntries = crawler.crawl(url,"智学苑",2000,3500);
		for(val entry <- houseEntries){
			println(entry);
		}
	}
}

Selector overview

    * tagname: find elements by tag, e.g. a
    * ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
    * #id: find elements by ID, e.g. #logo
    * .class: find elements by class name, e.g. .masthead
    * [attribute]: elements with attribute, e.g. [href]
    * [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
    * [attr=value]: elements with attribute value, e.g. [width=500]
    * [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
    * [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
    * *: all elements, e.g. *

Selector combinations

    * el#id: elements with ID, e.g. div#logo
    * el.class: elements with class, e.g. div.masthead
    * el[attr]: elements with attribute, e.g. a[href]
    * Any combination, e.g. a[href].highlight
    * ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
    * parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
    * siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
    * siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
    * el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

Pseudo selectors

    * :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
    * :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
    * :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
    * :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
    * :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
    * :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
    * :containsOwn(text): find elements that directly contain the given text
    * :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
    * :matchesOwn(regex): find elements whose own text matches the specified regular expression
    * Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

更多的信息可以参考[http://jsoup.org/|http://jsoup.org/]

分享到：

模式识别和机器学习笔记第二章概率分 ... | Java序列化注意一些点

2011-03-20 19:22
浏览 4918
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论