论坛首页 → Java企业应用论坛 →

我做的lucene性能测试，以及对其搜索速度的疑问，望高手指点

全部 Hibernate Spring Struts iBATIS 企业应用 Lucene SOA Java综合 Tomcat 设计模式 OO JBoss

浏览 10715 次

锁定老帖子主题：我做的lucene性能测试，以及对其搜索速度的疑问，望高手指点

精华帖 (0) :: 良好帖 (0) :: 新手帖 (3) :: 隐藏帖 (0)

作者

正文

chudu
等级: 初级会员
性别:
文章: 7
积分: 40
来自: 上海

发表时间：2010-07-24

相关推荐:

更多相关推荐

Java综合

最近开始学习lucene,在网上看了不少性能测试，很多都是相互copy，都说100万条数据搜索速度在0.X秒级别。我觉得速度还可以，应该能应用在我现在所做的项目中，处于谨慎，我自己也做了下性能测试，却达不到网上的搜索速度，觉得疑惑，特贴出代码和环境信息。

     1，我的速度达不到0.x，是我的代码那里有问题吗？

     2，我搜索时，为什么建好索引后的第一次搜索往往能达到毫秒级别，而以后都是秒级别的？

下面是测试环境信息：

     测试CPU P8600 ，内存4G，源文件总大小1.18G，文件个数6938。

     不分词但存储的字段：文件名，文件路径，时间。分词但不存储：文件内容。采用Lucene3.0.0，用的分词器：IKAnalyzer3.2

测试结果：

megerFactory	MaxBufferedDocs	分词器	索引文件大小(M)	索引时间(ms)	搜索时间(ms)	搜索数量
默认10	默认10	IKAnalyzer3.2	773	1204218	641	1266	1235	1266	1250	100

100	默认10	IKAnalyzer3.2	782	1142031	657	1344	1328	1343	1328	100
1000	默认10	IKAnalyzer3.2	782	1143469	672	1359	1313	1344	1344	100

默认10	100	IKAnalyzer3.2	772	1074516	609	1281	1282	1266	1282	100
默认10	1000	IKAnalyzer3.2	773	1124516	593	1297	1281	1297	1281	100

下面是代码：

package org.lucene.mytest;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.util.Date;
import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
public class LuceneTest {
    static File dataDir = new File("D:\\study\\lucene\\data");
    static File indexDir = new File("D:\\study\\lucene\\index");
    static String content = "content";
    static String name = "fileName";
    static String indexDate = "indexDate";
    static String filePath = "filePath";
    public static void main(String[] args) {
         long startTime = System.currentTimeMillis();
         try {
         //索引，在搜索测试时注释。
         indexDocs(new SimpleFSDirectory(indexDir));
         } catch (IOException e) {
         e.printStackTrace();
         }
         long endTime = System.currentTimeMillis();
         System.out.println("index:-----" + (endTime - startTime));
        //搜索
        long startTime1 = System.currentTimeMillis();
        testSearch();
        long endTime1 = System.currentTimeMillis();
        System.out.println("search:----" + (endTime1 - startTime1));
    }
    public static void testSearch() {
        try {
            Directory indexFSDir = new SimpleFSDirectory(indexDir);
            IndexReader indexReader = IndexReader.open(indexFSDir, true);
            Searcher searcher = new IndexSearcher(indexReader);
            QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
                    content, new IKAnalyzer());
            Query query = parser.parse("经受住了大战和时间的洗礼");
            TopDocs tdoc = searcher.search(query, 100);
            for (ScoreDoc scoreDoc : tdoc.scoreDocs) {
                Document doc = searcher.doc(scoreDoc.doc);
            }
            searcher.close();
            indexFSDir.close();
        } catch (CorruptIndexException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (ParseException e) {
            e.printStackTrace();
        }
    }
    public static void indexDocs(Directory indexDir) {
        if (indexDir == null || dataDir == null)
            return;
        IndexWriter iw = null;
        try {
            iw = new IndexWriter(indexDir, new IKAnalyzer(), true,
                    IndexWriter.MaxFieldLength.UNLIMITED);
            for (File f : dataDir.listFiles()) {
                Document doc = readDocument(f);
                if (doc != null)
                    iw.addDocument(doc);
            }
            iw.close();
            indexDir.close();
        } catch (CorruptIndexException e) {
            e.printStackTrace();
        } catch (LockObtainFailedException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    public static Document readDocument(File file) {
        Document doc = new Document();
        Field c;
        try {
            c = new Field(content, new BufferedReader(new InputStreamReader(
                    new FileInputStream(file), "GBK")));
            doc.add(c);
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        Field n = new Field(name, file.getName(), Field.Store.YES,
                Field.Index.NOT_ANALYZED);
        doc.add(n);
        Field p = new Field(filePath, file.getAbsolutePath(), Field.Store.YES,
                Field.Index.NOT_ANALYZED);
        doc.add(p);
        Field id = new Field(indexDate, new Date().toString(), Field.Store.YES,
                Field.Index.NOT_ANALYZED);
        doc.add(id);
        return doc;
    }
}

声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。

推荐链接

返回顶楼

kafka0102
等级: 初级会员
性别:
文章: 36
积分: 84
来自: 北京

发表时间：2010-07-25

以你上G的数据建索引需要个20分钟是有些慢，但也是可想像范围内。那么多的数据显然不可能秒级别搞定的，连分词的时间都不够用。至于查询，不知道你连续的测试是在testSearch里加了for循环测的吗？通常在一次查询过后后续的会更快的，但如果你没有特别指定启动参数，也有可能是内存不够导致频繁的gc，可打印gc信息观察下。

返回顶楼

回帖地址

0 请登录后投票

chudu
等级: 初级会员
性别:
文章: 7
积分: 40
来自: 上海

发表时间：2010-07-25

kafka0102 写道

恩，还有比较疑惑：
第一次的搜索时间为什么明显快些。

#   long startTime = System.currentTimeMillis();  
#          try {  
#          //索引，在搜索测试时注释。  
#          indexDocs(new SimpleFSDirectory(indexDir));  
#          } catch (IOException e) {  
#          e.printStackTrace();  
#          }  
#          long endTime = System.currentTimeMillis();  
#          System.out.println("index:-----" + (endTime - startTime));  
#         //搜索  
#         long startTime1 = System.currentTimeMillis();  
#         testSearch();  
#         long endTime1 = System.currentTimeMillis();  
#         System.out.println("search:----" + (endTime1 - startTime1));

第一次是执行完索引后直接执行的搜索，第二次及以后都是将索引注释后去执行的。

返回顶楼

回帖地址

0 请登录后投票

chudu
等级: 初级会员
性别:
文章: 7
积分: 40
来自: 上海

发表时间：2010-07-25

chudu 写道

kafka0102 写道

恩，还有比较疑惑：
第一次的搜索时间为什么明显快些。

#   long startTime = System.currentTimeMillis();  
#          try {  
#          //索引，在搜索测试时注释。  
#          indexDocs(new SimpleFSDirectory(indexDir));  
#          } catch (IOException e) {  
#          e.printStackTrace();  
#          }  
#          long endTime = System.currentTimeMillis();  
#          System.out.println("index:-----" + (endTime - startTime));  
#         //搜索  
#         long startTime1 = System.currentTimeMillis();  
#         testSearch();  
#         long endTime1 = System.currentTimeMillis();  
#         System.out.println("search:----" + (endTime1 - startTime1));

第一次是执行完索引后直接执行的搜索，第二次及以后都是将索引注释后去执行的。不是for循环执行testSearch,而是重新执行main.

返回顶楼

回帖地址

0 请登录后投票

anhaoy
等级: 初级会员
性别:
文章: 20
积分: 40
来自: 杭州

发表时间：2010-07-26

IndexReader打开之后不关闭，持续使用的话，搜索速度是很快的！
Lucene的索引是有缓存的，基于IndexReader的缓存。G级的索引文件、百万条以上的数据，查找（分页，每页数十条记录）花费时间应该是几十毫秒，更低也是有可能的。

返回顶楼

回帖地址

0 请登录后投票

sw861203
等级: 初级会员
性别:
文章: 6
积分: 40
来自: 武汉

发表时间：2010-07-26

你是不是放在junit里测试的。正常的话第一次很慢，后面应该很快的。

返回顶楼

回帖地址

0 请登录后投票

sw861203
等级: 初级会员
性别:
文章: 6
积分: 40
来自: 武汉

发表时间：2010-07-26

我的意思是起项目测试，不要放在junit里测。

返回顶楼

回帖地址

0 请登录后投票

lishuaibt
等级:
性别:
文章: 118
积分: 180
来自: 杭州

发表时间：2010-07-26

# TopDocs tdoc = searcher.search(query, 100);
#             for (ScoreDoc scoreDoc : tdoc.scoreDocs) {
#                 Document doc = searcher.doc(scoreDoc.doc);
#             }
#             searcher.close();
#             indexFSDir.close();

IndexReader的open操作是比较耗时的操作，如果没有索引的变更，是没有必要这么频繁的close()后重新open的，重用IndexReader对象，会减少很所不必要的重复加载，比如SegementInfo的加载等操作。。。。

返回顶楼

回帖地址

0 请登录后投票