Lucene(全文检索技术)入门级笔记整之一——第一个Lucene程序 .

yang7527

浏览: 130771 次
性别:
来自: 石家庄

最近访客更多访客>>

kitlee

kokorodo

荣归故里

浅绿蓂荚

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Java

Lucene(全文检索技术)入门级笔记整之一——第一个Lucene程序

1. Lucene 是全文搜索领域在近年比较流行的一种技术。

apache软件基金会jakarta项目组的一个子项目，是一个开源的全文检索引擎工具包。--说它是全文搜索引擎不太准确。

目前最新的的版本是3.0.3。下载地址是：http://apache.etoak.com/lucene/java/

本文的示例代码基于 Lucene3.0.1 编写。相比 3.0.3 我至今没发现到底有什么区别。因此请放心 Copy。

2. 什么是全文搜索？

全文检索（Full-Text Retrieval）是指以文本作为检索对象，找出含有指定词汇的文本。全面、准确和快速是衡量全文检索系统的关键指标。

关于全文检索，我们要知道：1，只处理文本。2，不处理语义。3，搜索时英文不区分大小写。4，结果列表有相关度排序。

在信息检索工具中，全文检索是最具通用性和实用性的。全文检索领域中也有多种框架，Lucene就是其中的一个开源的全文检索框架。

3. Lucene 的应用场景

Lucene，主要是做站内搜索，即对一个系统内的资源进行搜索。如BBS、BLOG中的文章搜索，网上商店中的商品搜索等。应用广泛。

4. 术语：

* 索引和索引库

把要检索的资源集合放到本地，并使用某种特定的结构存储，称为索引。

这个索引的集合称为索引库。

索引库是一个目录，里面是一些二进制文件，就如同数据库，所有的数据也是以文件的形式存在文件系统中的。Lucene 提供了一系列优秀的 API 操作这些文件。

* 倒排序索引

索引库对在大数据量下的查询效率有非常高的要求。因此，索引库的结构是根据全文检索的特征，专门按照快速查询设计的。

它的原理大概是这样：

索引库中保存了一个词汇表，这个词汇表，记录了库中所有出现过的词汇，并通过一种特殊的机制，描述了库中的哪些文档使用了这个词汇。

例如可能是这样：

Lucene --> 文档1，文档3, 文档5

全文搜索 --> 文档1, 文档6,文档3

领域 --> 文档1、文档3

当用户在搜索 "全文搜索领域" 这两个词的时候。得益于这种储存结构，会很快速的定位的 "文档1, 文档6, 文档3" 这三个匹配文档，其中 "文档1" 因为全部匹配了所有关键字，因此得分最高，将被置顶。

5. HelloWorld -- 第一个 Lucene 程序

** 这个程序将尽可能简单，但是再简单，也应该分为两块：保存和搜索。-- 保存就是在索引库创建一个索引。搜索，就是在索引库中按照搜索条件，查询出匹配数据。

** LuceneHelloWorld.java

// 第一步：添加内容索引
  public void createContentIndex() {
      // 创建内容对象
			
      // 保存
  }
// 第二步：搜索
  public void search() {
      // 搜索条件
	String queryStr = "HelloWorld";
      // 搜索，得到结果
	List list;
			
      // 显示条件
	syso: 打印搜索结果
  }

** 上面是大体的框架，再细化就要使用到 Lucene 的 API 了，因此接下来导入包：

常用的包一般有四个：

lucene-core-3.0.1.jar（核心包）

contrib/analyzers/common/lucene-analyzers-3.0.1.jar（分词器）

contrib/highlighter/lucene-highlighter-3.0.1.jar（高亮）

contrib/memory/lucene-memory-3.0.1.jar（高亮）

** 再细化代码：

第一步：添加内容索引

public void createContentIndex() {
	// 创建内容对象，自定义 Article 类
	Article article = new Article(); 
	调用 article.setId、setTitle、getContent 方法为 article 对象赋值
			
	// 保存到索引库
			
	// 怎样才能实现保存呢？我们需要构建一个 Document 对象，这个对象就携带了我们需要保存的所有数据。
	// 因此结下来的事情就是如何将我们自己的实体对象转换成 Document 对象 
	   Document doc = new Document();
	// 给 doc 添加数据，调用 add 方法 ：
	   doc.add(Field) ？？Field 是什么？？
			          
	// 将 doc 对象，保存到索引库
	   IndexWriter indexWriter；  // ??? 怎样获取 IndexWriter 对象
	   indexWriter.addDocument(doc);
	   indexWriter.close()
}

新建文章（Article）对象：

public class Article {
     private Integer id;
     private String title;
     private String content;
     ...
     getter AND setter...
}

第二步：搜索

public void search() {
     // 搜索条件
     String queryStr = "HelloWorld";
     // 搜索，得到结果
     List list;
     // 使用 IndexSearcher 的实例对象来搜索索引库
     IndexSearcher indexsS; // ??? 如何获取 IndexSearcher 对象
     // search 方法接收两个参数：Query 表示查询条件，100 表示只获取匹配的前100条记录
     TopDocs td = indexS.search(Query, 100);  // ??? Query 对象怎么获取
     // 获取查询条件在索引库中共匹配了多少个文档
     int count = td.totalHits; 
     // 获取匹配集合
     ScoreDoc[] sds = TopDocs.scoreDocs; 
     // 显示结果
     syso:
}

** 解决问题

1. ？？Field 是什么？？

Field 对象描述了存储在索引库中的 Document 对象的组成元素。

例如，这个关系类似于我们在数据库中一张表存储了多条记录。

我们可以将 Document 就看成一张表，表中存储了多个字段。这些字段合起来构成了整个表。

但是，一张数据表，并不仅仅是全由字段构成，还有一些描述整个表或字段的描述性元数据。

Field(String name, String value, Store store, Index index) ：

Store store：是否存储这个字段到索引库。

1. Store.YES：存储。2. Store.NO:忽略，不存储，并且此字段在搜索将不能获取到。

Index 决定了三种更新索引目录的策略：

1. Index.NO: 不更新 2. Index.ANALYZED：分词后，更新 3. Index.ANALYZED_NO_NORMS：不分词，更新

2. ??? 怎样获取 IndexWriter 对象

IndexWriter 能够将一个携带了数据的 Document 对象保存到索引库。

其实索引库是什么？映射到磁盘上就是一个文件。

因此，IndexWriter 还需要知道，索引库在哪里？

IndexWriter(Directory d, Analyzer a, MaxFieldLength mfl)

Directory 对象描述了索引库在磁盘上的位置

Directory directory = FSDirectory.open(new File("./indexDir/"));

Analyzer analyzer ：分词器。

这是很重要的一个概念，lucene 要管理它的数据，完成搜索，其中 Analyzer 很重要。

在保存 Document 的时候，分词器会将文本按照 “词” 打散，然后保存。

在搜索的时候，同样要指定分词器。它会同样将我们简单的搜索条件 “分词”，再在索引目录中搜索 -- 找到这个词，再从描述信息中获知，这个词在哪些文档出现过。

每种语言环境下的分词机制是不同的。它需要能理解 “我现在在写文章” —— "我"是一个词；"现在"是一个词，而不是"现"、"在"...

基于这种机制，提示我们，在保存（创建索引）和搜索的时候，应该要使用同一个分词器。

例如：

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

StandardAnalyzer 是 Lucene 提供的标准分词器。英文是按照一个空格一个词来划分，中文是一个字一个词来划分。

对于中文分词，通常有三种方式：单字分词、二分法分词、词典分词。

通常词库分词被认为是最理想的中文分词算法。如："我们是中国人"，效果为："我们"、"中国人"。

常见的分词器有：极易分词的(MMAnalyzer) 、"庖丁分词"分词器(PaodingAnalzyer)、IKAnalyzer 等等。其中 MMAnalyzer 和 PaodingAnalzyer 不支持 lucene3.0及以后版本。

MaxFieldLength mfl ：字段最大长度

new MaxFieldLength(10000) 、MaxFieldLength.LIMITED -- 10000、MaxFieldLength.UNLIMITED

3. ??? 如何获取 IndexSearcher 对象

如何获取IndexSearcher : 告诉我我要去哪里查找（索引库在哪里？）

Directory indexDir = FSDirectory.open(new File("./indexDir/")); 索引库所在的目录

IndexSearcher indexSearcher = new IndexSearcher(indexDir);

4. ??? Query 对象怎么获取

Query 对象需要依靠它的一个解析器 QueryParser 来构建。

QueryParser 的作用就是作为业务需求和 Lucene 的查询通信中间介。它可以将我们业务中描述的查询条件，翻译成 Lucene 能够理解的查询条件。

例如我们定义的查询条件是：

String queryStr = "HelloWorld"; // 我们希望 lucene 能够理解我们的条件，查询资源集合中包含这个词的资源

QueryParser queryParser = new QueryParser(Version.LUCENE_30, "content", analyzer);

Version.LUCENE_30：LUCENE版本——Match settings and bugs in Lucene's 3.0 release.

"content":要搜索哪一个字段——document 再保存的时候，是按照了 Filed 来区分数据类型。本例中，这里表示，在"content"字段中进行搜索

analyzer 又是一个分词器，我们应该使用和创建索引时一致的分词器。

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

** 基于这些问题的解决，程序便可以继续写下去：

第一步：添加内容索引

public void createContentIndex() {
	Article article = new Article(); 
	... // 这里给 article 赋值
	
	
	Document doc = new Document();
	doc.add(new Field("id", article.getId()+"", Store.YES, Index.ANALYZED));
	... // 这里继续添加字段
	
	/* 指定索引库所在的目录 */          
	Directory indexDir = FSDirectory.open(new File("./indexDir/"));
	/* 使用标准分词器 */
	Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
	
	IndexWriter indexWriter = new IndexWriter(indexDir, analyzer, MaxFieldLength.LIMITED);
	indexWriter.addDocument(doc);
	indexWriter.close()
}

第二步：搜索

public void search() {
	String queryStr = "HelloWorld";
	
	List list;
	
	/* 指定搜索的索引库的位置——若在指定位置不存在索引库将抛出异常 */
	Directory indexDir = FSDirectory.open(new File("./indexDir/"));
	/* 获取搜索对象 */
	IndexSearcher indexSearcher = new IndexSearcher(indexDir);
	
	/* 构建分词器 */
	Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
	QueryParser queryParser = new QueryParser(Version.LUCENE_30, "content", analyzer);
	Query query = queryParser.parse(queryStr);
	
	
	/* 
 	 * 此时并没有真正获取到内容数据，我们所能从中获取到的仅仅是指向匹配 Document 的 ID 。
 	 * 类似 Hibernate 的 Session.load() 方法。
 	 */
	TopDocs td = indexS.search(Query, 100); 
	
	/*
 	 * 所匹配的总记录数。这个记录数不受 indexSearcher.search(query, 100) 的第二个参数的影响
 	 * 100 在这里表示返回所匹配的记录的前100条记录
 	 * 而 topDocs.totalHits 表示当前查询在索引库中匹配了多少条记录。例如，有500条匹配，就返回 500 。
 	 */
	int count = td.totalHits; 
	
	ScoreDoc[] sds = td.scoreDocs; 
	
	
	/*
 	 * 既然获取匹配的每个 Document 的 ID 集合。
 	 * 因此可以用  
 	 */
	for(ScoreDoc scoreDoc : scoreDocs) {
		Document document = indexSearcher.doc(scoreDoc.doc);
		
		Article article = new Article();
		article.setId(Integer.parseInt(document.get("id")));
		... // 继续赋值
		
		list.add(article);
	}
	
	
	// 显示条件
	syso: 循环 list
}

** 完整的 Lucene HelloWorld 程序

/**
 * "文章" 实体
 */
public class Article {
	private Integer id;
	private String title;
	private String content;
	public Integer getId() {
		return id;
	}
	public void setId(Integer id) {
		this.id = id;
	}
	public String getTitle() {
		return title;
	}
	public void setTitle(String title) {
		this.title = title;
	}
	public String getContent() {
		return content;
	}
	public void setContent(String content) {
		this.content = content;
	}
}
public class HelloWorld {
	/**
	 * 创建索引
	 */
	@Test
	public void createIndex() throws Exception {
		/*
		 * 第一步：将数据（通常表现为一个实体类）转换成 Lucene 能接受的 Document 对象
		 */ 
		Article article = new Article();
		article.setId(1);
		article.setTitle("wjh上天山");
		article.setContent("据悉，wjh已于昨日抵达天山。高歌一曲HelloWorld");
		
		Document document = new Document();
		document.add(new Field("id", article.getId()+"", Store.YES, Index.ANALYZED));
		document.add(new Field("title", article.getTitle(), Store.YES, Index.ANALYZED));
		document.add(new Field("content", article.getContent(), Store.YES, Index.ANALYZED));
		
		
		/*
		 * 第二步：构建 IndexWriter：提供索引库所在的目录位置、分词器、字段溢出大小
		 */
		Directory indexDir = FSDirectory.open(new File("./indexDir/"));
		// 标准分词器，另外 Lucene 还提供了针对多种语言的分词器
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		IndexWriter indexWriter = new IndexWriter(indexDir, analyzer, MaxFieldLength.LIMITED);
		
		
		/*
		 * 第三步：将 document 保存到索引库 —— 分词后，建立索引。最后别忘了关闭 indexWriter。
		 */
		indexWriter.addDocument(document);
		indexWriter.close();
	}
	
	/**
	 * 搜索
	 */
	@Test
	public void search() throws Exception {
		/*
		 * 搜索条件
		 */
		String queryStr = "HelloWorld";
		
		/*
		 * 用于存放查询结果
		 */
		List<Article> list = new ArrayList<Article>();
		
		
		/*
		 * 构建 IndexSearcher ：提供 Directory —— 到哪里查？ 
		 * FSDirectory.open(File) 方法可以打开到某一个目录下的索引库的连接
		 */
		Directory indexDir = FSDirectory.open(new File("./indexDir/"));
		IndexSearcher indexSearcher = new IndexSearcher(indexDir);
		
		
		/*
		 * 构建建 Query 对象 —— 将查询条件，解析能被 Lucene 搜索机制支持的查询条件对象
		 * 需要指定分词器：Analyzer, 以什么方式来对查询条件分词？
		 * QueryParser：可以创建 Query
		 */
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		QueryParser queryParser = new QueryParser(Version.LUCENE_30, "content", analyzer);
		Query query = queryParser.parse(queryStr);
		
		
		/* 
		 * 此时并没有真正获取到内容数据，我们所能从中获取到的仅仅是指向匹配 Document 的 ID 。
		 * 类似 Hibernate 的 Session.load() 方法。
		 */
		TopDocs topDocs = indexSearcher.search(query, 100);
		
		/*
		 * 所匹配的总记录数。这个记录数不受 indexSearcher.search(query, 100) 的第二个参数的影响
		 * 100 在这里表示返回所匹配的记录的前100条记录
		 * 而 topDocs.totalHits 表示当前查询在索引库中匹配了多少条记录。例如，有500条匹配，就返回 500 。
		 */
		int totalCount = topDocs.totalHits;
		
		/* 匹配的 doc 集合，如上面注释，这个集合里的每个元素仅仅是指向匹配 Document 的 ID 。 */
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		
		/*
		 * 既然获取匹配的每个 Document 的 ID 集合。
		 * 因此可以用  
		 */
		for(ScoreDoc scoreDoc : scoreDocs) {
			Document document = indexSearcher.doc(scoreDoc.doc);
			
			Article article = new Article();
			article.setId(Integer.parseInt(document.get("id")));
			article.setTitle(document.get("title"));
			article.setContent(document.get("content"));
			
			list.add(article);
		}
		
		System.out.println("当前共匹配了 " + totalCount + "  条记录：");
		// 显示条件
		for (Article article : list) {
			System.out.println("id:" + article.getId());
			System.out.println("title:" + article.getTitle());
			System.out.println("content:" + article.getContent());
			System.out.println("----------------");
		}
		
	}
}

分享到：

Lucene入门级笔记五 -- 分词器，使用中文分 ... | Lucene入门级笔记二 -- 索引库的CRUD API ...

2011-10-24 22:50
浏览 1203
评论(0)
分类:研发管理
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论