Jony.Hwong

浏览: 114744 次
来自: ...

最近访客更多访客>>

u012996571

wuolf007

xhinliang

patato

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Lucene使用指南

博客分类：

技术文章

lucene 3.4 3.4.0 API

Lucene使用指南

Lucene简介

Lucene是一个基于Java的、高性能的全文检索工具包，它目前是著名的 Apache Jakarta 家族中的一个开源项目，也是目前最为流行的基于 Java 开源全文检索工具包。但它不是一个完整的搜索应用程序，而是为应用程序提供索引和搜索功能。

Lucene 是为文本类型的数据建立索引的，所以只要能把需要索引的数据格式转化的文本的，Lucene 就能对文档进行索引和搜索。比如HTML、PDF，都可以转换文本再交给Lucene进行索引。

1. Lucene环境

1.1 Lucene版本

当前版本：取当前最新版本Lucene Core 3.4.0

下载地址：http://lucene.apache.org/java/docs/index.html

linux版本: lucene-3.4.0.tgz

windows版本：lucene-3.4.0.zip

1.2 IKAnalyzer中文分词器

Lucene的分词器以接口Analyzer的形式对外提供，外部根据业务需要实现该分词器。Lucene本身提供了标准分词器StandarAnalyzer，针对英文的分词。

中文分词器现在比较成熟的是开源项目的IKAnalyzer，是针对中文的分词，目前最新版本是IKAnalyzer3.2.8.jar

下载地址: http://code.google.com/p/ik-analyzer/downloads/list

2. Lucene和应用程序的关系

3. Lucene API使用

3.1 建立索引

为了对文档进行索引，Lucene 提供了五个基础的类，他们分别是 Document, Field, IndexWriter, Analyzer, Directory。下面我们分别介绍一下这五个类的用途：

Document

Document 是用来描述文档的，这里的文档可以指一个 HTML 页面，一封电子邮件，或者是一个文本文件。一个 Document对象由多个 Field 对象组成的。可以把一个 Document 对象想象成数据库中的一个记录，而每个 Field 对象就是记录的一个字段。

Field

Field 对象是用来描述一个文档的某个属性的，比如一封电子邮件的标题和内容可以用两个 Field 对象分别描述。

Analyzer

在一个文档被索引之前，首先需要对文档内容进行分词处理，这部分工作就是由 Analyzer 来做的。Analyzer 类是一个抽象类，它有多个实现。针对不同的语言和应用需要选择适合的 Analyzer。Analyzer 把分词后的内容交给 IndexWriter 来建立索引。

IndexWriter

IndexWriter 是 Lucene 用来创建索引的一个核心的类，他的作用是把一个个的 Document 对象加到索引中来。

使用例子

public void createIndexs() throws Exception

{

String indexDir = "d:\\Temp\\lucence\\indexDir";

String dataDir = "d:\\Temp\\lucence\\dataDir";

Analyzer analyzer = new IKAnalyzer(true); // 使用中文分词器

File dir = new File(dataDir);

File[] files = dir.listFiles();

Directory fsDirectory = FSDirectory.open(new File(indexDir));

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer);

config.setOpenMode(OpenMode.CREATE_OR_APPEND);

config.setMaxBufferedDocs(1000);

IndexWriter indexWriter = new IndexWriter(fsDirectory, config);

for(int i = 0; i < files.length; i++)

{

String filePath = files[i].getAbsolutePath();

if(filePath.endsWith(".html") || filePath.endsWith(".htm"))

{

HTMLDocParser htmlParser = new HTMLDocParser(filePath);

String path = htmlParser.getPath();

String title = htmlParser.getTitle();

Reader content = htmlParser.getContent();

Document document = new Document();

document.add(new Field("path", path, Field.Store.YES,

Field.Index.NO, Field.TermVector.NO));

document.add(new Field("title", title, Field.Store.YES,

Field.Index.ANALYZED,

Field.TermVector.WITH_POSITIONS_OFFSETS));

document.add(new Field("content", content,

Field.TermVector.WITH_POSITIONS_OFFSETS));

indexWriter.addDocument(document);

}

indexWriter.commit();

indexWriter.optimize();

indexWriter.close();

}

3.2 搜索文档

在上面一部分中，我们已经为一个目录下的文本文档建立好了索引，现在在这个索引上进行搜索以找到包含某个关键词或短语的文档。Lucene 提供了几个基础的类来完成这个过程，它们分别是呢 IndexSearcher, Query, QueryParser,TopDocs. 下面我们分别介绍这几个类的功能。

Query

这是一个抽象类，Lucene针对不同的类型提供了不同的实现，比如 TermQuery, BooleanQuery, PrefixQuery,PhraseQuery等. 这个类的目的是把用户输入的查询字符串封装成 Lucene 能够识别的 Query。

QueryParser

如果不乐意去了解诸如BooleanQuery，PhraseQuery等看上去复杂的查询类型。希望的是输入一个字符串，它就能够理解用户的搜索意图，然后转换成lucene中合理的Query子类，提供给lucene进行搜索，那这个就是QueryParser。QueryParser能够根据用户的输入来进行解析，自动构建合适的Query对象。

IndexSearcher

IndexSearcher 是用来在建立好的索引上进行搜索的。它只能以只读的方式打开一个索引，所以可以有多个 IndexSearcher的实例在一个索引上进行操作。

TopDocs

TopDocs是用来保存搜索的结果。保存前N条得分高的记录。

使用例子

public List search(String strQuery) throws Exception

{

List searchResult = new ArrayList();

String indexDir = "d:\\Temp\\lucence\\indexDir";

String field = "content";

Analyzer analyzer = new IKAnalyzer(true); // 使用中文分词器

Directory fsDirectory = FSDirectory.open(new File(indexDir));

IndexSearcher indexSearcher = new IndexSearcher(fsDirectory, true);

QueryParser queryParser = new QueryParser(Version.LUCENE_34, field, analyzer);

Query query = queryParser.parse(strQuery);

if (null != query && null != indexSearcher)

{

TopDocs hits = indexSearcher.search(query, 1000);

int totalHits = hits.totalHits;

int len = Math.min(1000, totalHits);

ScoreDoc[] docs = hits.scoreDocs;

for (int i = 0; i < len; i++)

{

SearchResultBean resultBean = new SearchResultBean();

Document doc = indexSearcher.doc(docs[i].doc);

resultBean.setHtmlPath(doc.get("path"));

resultBean.setHtmlTitle(doc.get("title"));

searchResult.add(resultBean);

}

return searchResult;

}

4. 附录

提供一个简单的Lucene Demo工程。参考lucene-demo.rar。

lucene-demo.rar (2.5 MB)
下载次数: 172

分享到：

Extjs实现快捷键CTRL+TAB对Tabpanel进行Ta ... | memcache及其telnet命令使用详解

2011-12-09 15:20
浏览 4825
评论(1)
分类:互联网
查看更多

1 楼 xiaolv 2012-02-08

String indexDir = "d:\\Temp\\lucence\\indexDir";

String dataDir = "d:\\Temp\\lucence\\dataDir";

这两个地址干什么用?
地址下面还要自己建*.html的文件?

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene使用指南

Lucene简介

1. Lucene环境

1.1 Lucene版本

1.2 IKAnalyzer中文分词器

2. Lucene和应用程序的关系

3. Lucene API使用

3.1 建立索引

Document

Field

Analyzer

IndexWriter

Directory

使用例子

3.2 搜索文档

Query

QueryParser

IndexSearcher

TopDocs

使用例子

4. 附录

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene使用指南

Lucene简介

1. Lucene环境

1.1 Lucene版本

1.2 IKAnalyzer中文分词器

2. Lucene和应用程序的关系

3. Lucene API使用

3.1 建立索引

Document

Field

Analyzer

IndexWriter

Directory

使用例子

3.2 搜索文档

Query

QueryParser

IndexSearcher

TopDocs

使用例子

4. 附录

评论

发表评论

相关推荐

【jira3.6破解版】界面中文乱码问题解决办法

Extjs实现快捷键CTRL+TAB对Tabpanel进行Tab切换

memcache及其telnet命令使用详解

java序列化的一点经验

Jad.exe反编译工具和jad eclipse插件

smartcare各类文档

Sock Demo

实时监控方案分析

数据库连接池-C3P0配置

Linux自动化分区

JVM调优

LVS集群

Application Layer Gateway Service 和 FTP

FTP - RFC959 中文版

网络地址转换NAT

Firewall rules for FTP+SSL Explicit

ftp与防火墙

区分网上邻居

论FTP的主动被动和相关的防火墙设置

SSL Handshake ：Bad Record Mac

最近访客更多访客>>