Lucene使用心得

flyPig

浏览: 137039 次
性别:
来自: 成都

最近访客更多访客>>

zwy133

kokorodo

Hello---World

slipper-jay

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Framework

lucene 搜索引擎 F#Excel Eclipse

Lucene中两个最重要的概念，索引和搜索
   索引：一个比较经典的例子：Eclipse中搜索带有指定字符串“aaa”的所有文件。如果顺序的扫描文件查找，这会是相当的郁闷。这时就出现了索引：为了快速搜索大量的文本，首先索引那个文本然后把它转化为一个可以快速搜索的格式，因此可以除去缓慢的顺序地扫描过程。这个转化过程称为索引，它的输出称为一条索引。索引就可以认为是一个快速随机访问存于其内部的词的数据结构。
   搜索：搜索是在一个索引中查找指定字符串来找出它们所出现的文档的过程。

一些基本的类说明:
Document:Document相当于一个要进行索引的单元，任何可以想要被索引的文件都必须转化为Document对象才能进行索引,可以把它看成数据库里面的一行记录。
Field：就像数据库中的字段，它有三种属性，isStored（是否被存储），isIndexed（是否被索引），isTokenized（是否分词）。
IndexWriter:它主要是用来将Document加入索引，同时控制索引过程中的一些参数使用。
IndexSearcher:lucene中最基本的检索工具，所有的检索都会用到IndexSearcher工具。
Analyzer:分析器,主要用于分析搜索引擎遇到的各种文本。
Directory:索引存放的位置;lucene提供了两种索引存放的位置，一种是磁盘，一种是内存。一般情况将索引放在磁盘上；相应地lucene提供了FSDirectory和RAMDirectory两个类。
Query:查询，lucene中支持模糊查询，语义查询，短语查询，组合查询等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些类。
QueryParser: 是一个解析用户输入的工具，可以通过扫描用户输入的字符串，生成Query对象。
Hits:在搜索完成之后，需要把搜索结果返回并显示给用户，只有这样才算是完成搜索的目的。在lucene中，搜索的结果的集合是用Hits类的实例来表示的。

   如下的例子，是对指定目录下所有java文件建立索引，然后搜索带“String”字符串的所有java文件。

public class FileSearch {
        private static SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss");
	public static void main(String[] args) throws Exception {		
		File indexDir = new File("e:\\lucene");
		File dataDir = new File("e:\\luceneData");

		long start = System.currentTimeMillis();
		int numIndexed = index(indexDir, dataDir);
		long end = System.currentTimeMillis();
		System.out.println("Indexing " + numIndexed + " files took "
				+ (end - start) + " milliseconds");
		
		search(indexDir,"String");
	}

	/**
	 * 对dataDir目录下所有文件生成索引文件并存放到indexDir目录下
	 * 
	 * @param indexDir
	 * @param dataDir
	 * @return
	 * @throws IOException
	 */
	public static int index(File indexDir, File dataDir) throws IOException {

		if (!dataDir.exists() || !dataDir.isDirectory()) {
			throw new IOException(dataDir
					+ " does not exist or is not a directory");
		}
		/*
		 * StandardAnalyzer表示用lucene自带的标准分词机制 false表示不覆盖原来该目录的索引,true表示覆盖
		 */
		IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(),
				true);
		writer.setUseCompoundFile(false);
		indexDirectory(writer, dataDir);
		int numIndexed = writer.docCount();
		//Optimize的过程就是要减少剩下的Segment的数量,尽量让它们处于一个文件中
		writer.optimize();
		writer.close();
		return numIndexed;
	}

	public static void search(File indexDir, String q) throws Exception {
		Directory fsDir = FSDirectory.getDirectory(indexDir, false);
		IndexSearcher is = new IndexSearcher(fsDir);//打开索引
		QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
		Query query = qp.parse(q);
		long start = System.currentTimeMillis();
		Hits hits = is.search(query); //搜索索引
		long end = System.currentTimeMillis();

		System.err.println("Found " + hits.length() + " document(s) (in "
				+ (end - start) + " milliseconds) that matched query ‘" + q
				+ "’:");

		for (int i = 0; i < hits.length(); i++) {
			Document doc = hits.doc(i); //得到匹配的文档
			System.out.println(doc.get("filename"));
		}
	}

	/**
	 * 递归建立索引
	 * 
	 * @param writer
	 * @param dir
	 * @throws IOException
	 */
	private static void indexDirectory(IndexWriter writer, File dir)
			throws IOException {
		for (File f : dir.listFiles()) {
			if (f.isDirectory()) {
				indexDirectory(writer, f);
			} else if (f.getName().endsWith(".java")) {
				indexFile(writer, f);
			}
		}
	}

	/**
	 * 为文件内容建立索引
	 * 
	 * @param writer
	 * @param f
	 * @throws IOException
	 */
	private static void indexFile(IndexWriter writer, File f)
			throws IOException {
		if (f.isHidden() || !f.exists() || !f.canRead()) {
			return;
		}
		/* 创建一份文件 */
		Document doc = new Document();
		/*
		 * 创建一个域filename，文件路径作为域里面的内容，并且添加到Document
		 * Field.Store.YES表示域里面的内容将被存储到索引
		 * Field.Index.TOKENIZED表示域里面的内容将被索引，以便用来搜索
		 */
		doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES,
				Field.Index.TOKENIZED));
                doc.add(new Field("filedate", format.format(f.lastModified()), Field.Store.YES,
				Field.Index.TOKENIZED));
		/*
		 * 创建一个域contents，文件的实体数据作为域里面的内容，并且添加到Document
		 */
		doc.add(new Field("contents", new BufferedReader(new FileReader(f)))); // 索引文件内容
		/* 添加到索引 */
		writer.addDocument(doc);
	}
}

StandardAnalyzer是Lucene自带的分词，它的主要功能有
1.对原有句子以空格进行了分词。
2.所有的大写字母都转换为小写、。
3.可以去掉一些没有用处的单词，例如"is","am","are"等单词，也删除了所有的标点。
它对中文的支持并不好，需要中文搜索的，有其他的分词包支持。

在这个例子里面只是搜索了文件内容，也就是单一Field搜索。如果我既要搜索单词还要指定日期，那就需要用到多个Field搜索了。如果索引文件存放在多个目录下，就需要用到多个目录的搜索，以下是这两者结合

public static void searchMore(File indexDir, String q,String date) throws Exception {
		Directory fsDir = FSDirectory.getDirectory(indexDir, false);
		IndexSearcher is = new IndexSearcher(fsDir);//打开索引
		/* 多目录,这只有一个目录 */  
		IndexSearcher indexSearchers[] = { is }; 
                //搜索 filename，filedate两个字段
		String[] fields = { "filename", "filedate" };
		String[] queries = {q,date};
                //两者都必须满足
		BooleanClause.Occur[] clauses = { BooleanClause.Occur.MUST, BooleanClause.Occur.MUST };
		Query query = MultiFieldQueryParser.parse(queries, fields, clauses, new StandardAnalyzer());
		/* 多目录搜索，这里只有一个目录 */  
		 MultiSearcher searcher = new MultiSearcher(indexSearchers); 
		 
		long start = System.currentTimeMillis();
		Hits hits = searcher.search(query); //搜索索引
		long end = System.currentTimeMillis();

		System.err.println("Found " + hits.length() + " document(s) (in "
				+ (end - start) + " milliseconds) that matched query ‘" + q
				+ "’:");

		for (int i = 0; i < hits.length(); i++) {
			Document doc = hits.doc(i); //得到匹配的文档
			System.out.println(doc.get("filename"));
		}
	}

这样就可以搜索出所有内容和日期都包含指定字符的所有文件。
如果需要模糊查询某个字段，可以这样

WildcardQuery query2 = new WildcardQuery(new Term("contents", "*er*"));

现在只能搜索txt或者类txt的文件，比如.java,.cpp,.properties等。比较常见的文件如word,excel,pdf,html之类的，又如何搜索呢。因为lucene索引的时候是将String型的信息建立索引的，所以必须是将word/pdf/html/pdf等文件的内容转化为String.
定义了一个转化接口

public interface FileConvert {
	public String read(String path) throws Exception;
}

先是一般情况下类txt转换。

public class TxtFileReader implements FileConvert {

	public String read(String path) throws Exception {
		StringBuffer content = new StringBuffer("");// 文档内容
		BufferedReader br = null;
		try {
			br = new BufferedReader(new FileReader(path));
			String s1 = null;
			while ((s1 = br.readLine()) != null) {
				content.append(s1 + "\r");
			}
		} finally {
			if(br != null) {
				br.close();
			}
		}
		return content.toString().trim();
	}
}

对于HTML文件，我们只需要它实际的内容，那些tr,td之类的标签是没有意义的。lucene的demo中有个现成的HTMLParser可以去掉这些无用信息。

public class HTMLFileReader implements FileConvert {

	/* (non-Javadoc)
	 * @see lucene.reader.FileConvert#read(java.lang.String)
	 */
	@Override
	public String read(String path) throws Exception {
		StringBuffer content = new StringBuffer("");
        BufferedReader reader = null;
        try {
        	FileInputStream fis = new FileInputStream(path);
        	//这里的字符编码要对上html头文件，否则会出乱码
            HTMLParser htmlParser = new HTMLParser(new InputStreamReader(fis,"utf-8"));
            reader = new BufferedReader(htmlParser.getReader());
            String line = null;
            while ((line = reader.readLine()) != null) {
                content.append(line + "\n");
            }          
        } finally {
        	if(reader != null) {
        		reader.close();
        	}
        }
        String contentString = content.toString();
        return contentString;
	}

}

PDF文件则可以用PDFBox来做。

public class PDFFileReader implements FileConvert{

	@Override
	public String read(String path) throws Exception {
		 StringBuffer content = new StringBuffer("");// 文档内容
	     FileInputStream fis = null;
	     try{
	    	 fis = new FileInputStream(path);
		     PDFParser p = new PDFParser(fis);
		     p.parse();
		     PDFTextStripper ts = new PDFTextStripper();
		     content.append(ts.getText(p.getPDDocument()));
	     }finally {
	    	 if(fis != null) {
	    		 fis.close();
	    	 }
	     }	     
	     return content.toString().trim();
	}
}

word文件就可以用POI来解决了

public class DocFileReader implements FileConvert {

	/* (non-Javadoc)
	 * @see lucene.reader.FileConvert#read(java.lang.String)
	 */
	@Override
	public String read(String path) throws Exception {
		 StringBuffer content = new StringBuffer("");// 文档内容
	     HWPFDocument doc = new HWPFDocument(new FileInputStream(path));
	     Range range = doc.getRange();
	     int paragraphCount = range.numParagraphs();
	     for (int i = 0; i < paragraphCount; i++) {// 遍历段落读取数据
	        Paragraph pp = range.getParagraph(i);
	        content.append(pp.text());
	     }	       
	     return content.toString().trim();
	}
}

简单的封装以下：

public class StrategyReader implements FileConvert{
	private Map<String,FileConvert> convertMap = new HashMap<String,FileConvert>();
	
	public StrategyReader() {		
	}
	
	//外部调用初始化
	public void init() {
		FileConvert docFileReader = new DocFileReader();
		FileConvert htmlFileReader = new HTMLFileReader();
		FileConvert pdfFileReader = new PDFFileReader();
		FileConvert txtFileReader = new TxtFileReader();
		convertMap.put("txt", txtFileReader);
		convertMap.put("java", txtFileReader);
		convertMap.put("pdf", pdfFileReader);
		convertMap.put("html", htmlFileReader);
		convertMap.put("doc", docFileReader);
	}
	public void setConvertMap(Map<String, FileConvert> convertMap) {
		//可以由配置文件配置，IOC容器refrence
		this.convertMap = convertMap;
	}

	@Override
	public String read(String path) throws Exception {
		int suffixIndex = path.lastIndexOf(".");
		if(suffixIndex < 0) {
			return convertMap.get("txt").read(path);
		} else {
			String suffix = path.substring(suffixIndex);
			FileConvert convert = convertMap.get(suffix);
			if(convert != null) {
				return convert.read(path);
			} else {
				throw new Exception("can not convert " + path);
			}
		}
	}
	
}

这样，上面的示例程序，引入StrategyReader，只需要在indexFile和indexDirectory方法里面做一点小小的修改，就可以转换这些指定的文件为索引了。

一些其他的操作

// 删除索引
    public void deleteIndex(String indexDir){
        try {
            long start = System.currentTimeMillis();
            IndexReader reader = IndexReader.open(indexDir);
            int numFiles = reader.numDocs();
            for (int i = 0; i < numFiles; i++) {
                // 这里的删除只是做一个删除标记，可以看到执行deleteDocument后会产生一个del后缀的文件用来记录这些标记过的文件
                reader.deleteDocument(i);
            }
            reader.close();
            long end = System.currentTimeMillis();
            System.out.println("delete index: " + (end - start) + " total milliseconds");
        } catch (IOException e) {
            System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage());
        }
    }

    // 恢复已删除的索引
    public void unDeleteIndex(String indexDir){
        try {
            IndexReader reader = IndexReader.open(indexDir);
            reader.undeleteAll();
            reader.close();
        } catch (IOException e) {
            System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage());
        }
     }

分享到：

SpringAOP的一个问题 | db4o使用心得之二

2009-05-29 17:37
浏览 1121
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论