Beginning lucene

DAOException

浏览: 121090 次
性别:
来自: 南京

最近访客更多访客>>

manabout

sing9123

chenshigai8310

lzlongqiao

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucene

lucene Apache 全文检索搜索引擎数据结构

Lucene 是一个基于 Java 的全文信息检索工具包，它不是一个完整的搜索应用程序，而是为你的应用程序提供索引和搜索功能。Lucene 目前是 Apache Jakarta 家族中的一个开源项目。也是目前最为流行的基于 Java 开源全文检索工具包。

目前已经有很多应用程序的搜索功能是基于 Lucene 的，比如 Eclipse 的帮助系统的搜索功能。Lucene 能够为文本类型的数据建立索引，所以你只要能把你要索引的数据格式转化的文本的，Lucene 就能对你的文档进行索引和搜索。比如你要对一些 HTML 文档，PDF 文档进行索引的话你就首先需要把 HTML 文档和 PDF 文档转化成文本格式的，然后将转化后的内容交给 Lucene 进行索引，然后把创建好的索引文件保存到磁盘或者内存中，最后根据用户输入的查询条件在索引文件上进行查询。不指定要索引的文档的格式也使 Lucene 能够几乎适用于所有的搜索应用程序。

【注】红色内容转自IBM的blog

前面的都是废话啊，应该称之为所谓的八股文，我们还是先来看一看lucene在在搜索当中应用

如上图所示，大家可以看到lucene在搜索当中的位置，用户从页面（或者其他的方式）输入需要查询的数据,通过索引库我们可以查询到我们需要的东西。其实这中间可能大家会有很多误解，认为用户输入的数据，是通过服务器进行爬取得到的。其实不完全是这样。正如图上所示，原始数据可能是web当中的，也可能是文件系统或者数据库当中的。当然这些数据需要所谓的爬取才能得到。但是并不是在用户输入数据的时候爬取。搜索引擎爬取数据会对这些数据进行建立索引。索引对爬取的数据进行了一些分词处理等等，抽取实用信息。并对这些数据进行记录在自己的索引库当中。用户查询的时候只是从索引库当中获取需要查询的文件的位置。从而得到用户需要的相关信息。而lucene的作用就是如图所示的，建立索引，和通过索引进行数据查询操作。

下面我们来看一看lucene的组要类库吧。

1)org.apache.1ucene.analysis语言分析器，主要用于的切词Analyzer是一个抽象类，管理对文本内容的切分词规则。

2)org.apache.1uceene.document索引存储时的文档结构管理，类似于关系型数据库的表结构。

3)document包相对而言比较简单，该包下面有3个类，document相对于关系型数据库的记录对象，Field主要负责字段的管理。

4)org.apache.1ucene.index索引管理，包括索引建立、删除等。索引包是整个系统核心，全文检索的根本就是为每个切出来的词建索引，查询时就只需要遍历索引，而不需要去正文中遍历，从而极大的提高检索效率。

5)org.apache.1ucene.queryParser查询分析器，实现查询关键词间的运算，如与、或、非等。

6)org.apache.1ucene.search检索管理，根据查询条件，检索得到结果。

7)org.apache.1ucene.store数据存储管理，主要包括一些底层的I/0操作。

8)org.apache.1ucene.util一些公用类。

下面我们用一个实例来描述一下lucene的工作原理吧。首先我们要使用搜索引擎查询数据，我们必须得对需要查询数的据建立索引

package com.foolfish.lucene;

/**
 * @author foolfish.chen
 * @E-mail jianguo1001@gmail.com
 */
import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
/**
 * This class demonstrate the process of creating index with Lucene 
 * for text files
 */
public class TxtFileIndexer {
	public static void main(String[] args) throws Exception{
		//indexDir is the directory that hosts Lucene's index files
        File   indexDir = new File("D:\\luceneIndex");//创建索引目录
        //dataDir is the directory that hosts the text files that to be indexed
        File   dataDir  = new File("D:\\luceneData");//创建数据目录
        Analyzer luceneAnalyzer = new StandardAnalyzer();//使用标准分析器
        File[] dataFiles  = dataDir.listFiles();//列出数据目录当中的所有文件
        IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);//（1）创建indexwriter，
      //第一个参数表示索引目录，第二个参数表示使用的分析模式，第三个参数表示是否删除原来的索引。
        long startTime = new Date().getTime();
        for(int i = 0; i < dataFiles.length; i++){//循环遍历数据目录下的所有文件
        	if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){//寻找txt格式的文本文件
        		System.out.println("Indexing file " + dataFiles[i].getCanonicalPath());
        		Document document = new Document();//(2)创建document
        		Reader txtReader = new FileReader(dataFiles[i]);
        		//向document当中添加字段，字段由field组成，field第一个参数相当于字段名，第二个参数相当于字段内容
        		document.add(new Field("path",dataFiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
        		document.add(new Field("contents",txtReader));
        		indexWriter.addDocument(document);
        	}
        }
        indexWriter.optimize();
        indexWriter.close();
        long endTime = new Date().getTime();
        
        System.out.println("It takes " + (endTime - startTime) 
                           + " milliseconds to create index for the files in directory "
        		           + dataDir.getPath());        
	}
}

运行以上文件，可以建立索引。索引建立完毕我们需要的就是通过索引对数据进行查询，lucene也提供了这方面的类库。如下例所示：

package com.foolfish.lucene;

/**
 * @author foolfish.chen
 * @E-mail jianguo1001@gmail.com
 */
import java.io.File;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.FSDirectory;
/**
 * This class is used to demonstrate the 
 * process of searching on an existing 
 * Lucene index
 *
 */
public class TxtFileSearcher {
	public static void main(String[] args) throws Exception{
	    String queryStr = "你好";
	    //This is the directory that hosts the Lucene index
        File indexDir = new File("D:\\luceneIndex");
        FSDirectory directory = FSDirectory.getDirectory(indexDir, false);
        IndexSearcher searcher = new IndexSearcher(directory);
        if(!indexDir.exists()){
        	System.out.println("The Lucene index is not exist");
        	return;
        }
        Term term = new Term("contents",queryStr.toLowerCase());
        TermQuery luceneQuery = new TermQuery(term);
        Hits hits = searcher.search(luceneQuery);
        for(int i = 0; i < hits.length(); i++){
        	Document document = hits.doc(i);
        	System.out.println("File: " + document.get("path"));
        }
	}
}

好啦，我们就可以通过上面这段代码对数据目录当中的文本文件进行索引了。

分享到：