我做的lucene性能测试，以及对其搜索速度的疑问，望高手指点 -

chudu

浏览: 6282 次
性别:
来自: 上海

最近访客更多访客>>

qsjiangs

sosoab

Fly872365

jgq2008303393

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (3)

社区版块

存档分类

我做的lucene性能测试，以及对其搜索速度的疑问，望高手指点

lucene Apache junit memcached F#

最近开始学习lucene,在网上看了不少性能测试，很多都是相互copy，都说100万条数据搜索速度在0.X秒级别。我觉得速度还可以，应该能应用在我现在所做的项目中，处于谨慎，我自己也做了下性能测试，却达不到网上的搜索速度，觉得疑惑，特贴出代码和环境信息。

     1，我的速度达不到0.x，是我的代码那里有问题吗？

     2，我搜索时，为什么建好索引后的第一次搜索往往能达到毫秒级别，而以后都是秒级别的？

下面是测试环境信息：

     测试CPU P8600 ，内存4G，源文件总大小1.18G，文件个数6938。

     不分词但存储的字段：文件名，文件路径，时间。分词但不存储：文件内容。采用Lucene3.0.0，用的分词器：IKAnalyzer3.2

测试结果：

megerFactory	MaxBufferedDocs	分词器	索引文件大小(M)	索引时间(ms)	搜索时间(ms)	搜索数量
默认10	默认10	IKAnalyzer3.2	773	1204218	641	1266	1235	1266	1250	100

100	默认10	IKAnalyzer3.2	782	1142031	657	1344	1328	1343	1328	100
1000	默认10	IKAnalyzer3.2	782	1143469	672	1359	1313	1344	1344	100

默认10	100	IKAnalyzer3.2	772	1074516	609	1281	1282	1266	1282	100
默认10	1000	IKAnalyzer3.2	773	1124516	593	1297	1281	1297	1281	100

下面是代码：

package org.lucene.mytest;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.util.Date;
import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
public class LuceneTest {
    static File dataDir = new File("D:\\study\\lucene\\data");
    static File indexDir = new File("D:\\study\\lucene\\index");
    static String content = "content";
    static String name = "fileName";
    static String indexDate = "indexDate";
    static String filePath = "filePath";
    public static void main(String[] args) {
         long startTime = System.currentTimeMillis();
         try {
         //索引，在搜索测试时注释。
         indexDocs(new SimpleFSDirectory(indexDir));
         } catch (IOException e) {
         e.printStackTrace();
         }
         long endTime = System.currentTimeMillis();
         System.out.println("index:-----" + (endTime - startTime));
        //搜索
        long startTime1 = System.currentTimeMillis();
        testSearch();
        long endTime1 = System.currentTimeMillis();
        System.out.println("search:----" + (endTime1 - startTime1));
    }
    public static void testSearch() {
        try {
            Directory indexFSDir = new SimpleFSDirectory(indexDir);
            IndexReader indexReader = IndexReader.open(indexFSDir, true);
            Searcher searcher = new IndexSearcher(indexReader);
            QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
                    content, new IKAnalyzer());
            Query query = parser.parse("经受住了大战和时间的洗礼");
            TopDocs tdoc = searcher.search(query, 100);
            for (ScoreDoc scoreDoc : tdoc.scoreDocs) {
                Document doc = searcher.doc(scoreDoc.doc);
            }
            searcher.close();
            indexFSDir.close();
        } catch (CorruptIndexException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (ParseException e) {
            e.printStackTrace();
        }
    }
    public static void indexDocs(Directory indexDir) {
        if (indexDir == null || dataDir == null)
            return;
        IndexWriter iw = null;
        try {
            iw = new IndexWriter(indexDir, new IKAnalyzer(), true,
                    IndexWriter.MaxFieldLength.UNLIMITED);
            for (File f : dataDir.listFiles()) {
                Document doc = readDocument(f);
                if (doc != null)
                    iw.addDocument(doc);
            }
            iw.close();
            indexDir.close();
        } catch (CorruptIndexException e) {
            e.printStackTrace();
        } catch (LockObtainFailedException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    public static Document readDocument(File file) {
        Document doc = new Document();
        Field c;
        try {
            c = new Field(content, new BufferedReader(new InputStreamReader(
                    new FileInputStream(file), "GBK")));
            doc.add(c);
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        Field n = new Field(name, file.getName(), Field.Store.YES,
                Field.Index.NOT_ANALYZED);
        doc.add(n);
        Field p = new Field(filePath, file.getAbsolutePath(), Field.Store.YES,
                Field.Index.NOT_ANALYZED);
        doc.add(p);
        Field id = new Field(indexDate, new Date().toString(), Field.Store.YES,
                Field.Index.NOT_ANALYZED);
        doc.add(id);
        return doc;
    }
}

分享到：

swing jtextarea自动换行问题。

2010-07-24 22:44
浏览 2773
评论(17)
论坛回复 / 浏览 (16 / 10711)
分类:编程语言
查看更多

17 楼 srdrm 2010-08-04

应该还是没有使用好lucene，不会这么慢的。

16 楼 harbey 2010-07-27

个人感觉不要去太追求速度了，lucene是一个很成熟的解决方案，给我们应用开发提供了很多方便。可以用jprofiler去跑一下程序，看看性能参数。搜索的话可以用缓存加以优化（例如enhcache或memcached等）提高查询速度等。

15 楼 dukai1008 2010-07-27

我也想用在自己的项目中,测试过,速度很快的,上G的还没测试过,晚上回去试试,明天再跟贴!

14 楼 lishuaibt 2010-07-27

难道是IO缓存？操作系统的Cache。。

13 楼 chudu 2010-07-26

yangfuchao418 写道

不清楚楼主的测试。我把你的代码拷贝下来再公司里的机器测试。先把那两个文件夹建立好。然后按你说的运行。相差不过几十毫秒

?很神奇呀。

12 楼 yangfuchao418 2010-07-26

不清楚楼主的测试。我把你的代码拷贝下来再公司里的机器测试。先把那两个文件夹建立好。然后按你说的运行。相差不过几十毫秒

11 楼 chudu 2010-07-26

lishuaibt 写道

# TopDocs tdoc = searcher.search(query, 100);
#             for (ScoreDoc scoreDoc : tdoc.scoreDocs) {
#                 Document doc = searcher.doc(scoreDoc.doc);
#             }
#             searcher.close();
#             indexFSDir.close();

IndexReader的open操作是比较耗时的操作，如果没有索引的变更，是没有必要这么频繁的close()后重新open的，重用IndexReader对象，会减少很所不必要的重复加载，比如SegementInfo的加载等操作。。。。

我不是用的for,从代码可看出。

当每次索引后立即执行搜索方法，则搜索时间短。而如果将索引方法注释在执行main时，搜索方法用时长。你说indexReader每次打开操作比较耗时，但我每次都是同样的打开关闭，时间应在相同的一个范围内，而不会相差这么远啦。

为什么第一次执行速度明显比后面的快很多呢？我执行了十几次都是这样的。

10 楼 forchenyun 2010-07-26

关注。。。。。。

9 楼 ccx007 2010-07-26

indexReader 和IndexSearcher都设成全局的，不要每查询一次就关闭，再打开索引的是很耗时的，实际应用中也不会每次search都关闭searcher

8 楼 lzj0470 2010-07-26

没有那么慢的，我二百多万个文本文件，大小是10K，都很快。一般是在1秒左右。

7 楼 lishuaibt 2010-07-26

6 楼 sw861203 2010-07-26

我的意思是起项目测试，不要放在junit里测。

5 楼 sw861203 2010-07-26

你是不是放在junit里测试的。正常的话第一次很慢，后面应该很快的。

4 楼 anhaoy 2010-07-26

IndexReader打开之后不关闭，持续使用的话，搜索速度是很快的！
Lucene的索引是有缓存的，基于IndexReader的缓存。G级的索引文件、百万条以上的数据，查找（分页，每页数十条记录）花费时间应该是几十毫秒，更低也是有可能的。

3 楼 chudu 2010-07-25

chudu 写道

kafka0102 写道

以你上G的数据建索引需要个20分钟是有些慢，但也是可想像范围内。那么多的数据显然不可能秒级别搞定的，连分词的时间都不够用。至于查询，不知道你连续的测试是在testSearch里加了for循环测的吗？通常在一次查询过后后续的会更快的，但如果你没有特别指定启动参数，也有可能是内存不够导致频繁的gc，可打印gc信息观察下。

恩，还有比较疑惑：
第一次的搜索时间为什么明显快些。

#   long startTime = System.currentTimeMillis();  
#          try {  
#          //索引，在搜索测试时注释。  
#          indexDocs(new SimpleFSDirectory(indexDir));  
#          } catch (IOException e) {  
#          e.printStackTrace();  
#          }  
#          long endTime = System.currentTimeMillis();  
#          System.out.println("index:-----" + (endTime - startTime));  
#         //搜索  
#         long startTime1 = System.currentTimeMillis();  
#         testSearch();  
#         long endTime1 = System.currentTimeMillis();  
#         System.out.println("search:----" + (endTime1 - startTime1));

第一次是执行完索引后直接执行的搜索，第二次及以后都是将索引注释后去执行的。不是for循环执行testSearch,而是重新执行main.

2 楼 chudu 2010-07-25

kafka0102 写道

恩，还有比较疑惑：
第一次的搜索时间为什么明显快些。

#   long startTime = System.currentTimeMillis();  
#          try {  
#          //索引，在搜索测试时注释。  
#          indexDocs(new SimpleFSDirectory(indexDir));  
#          } catch (IOException e) {  
#          e.printStackTrace();  
#          }  
#          long endTime = System.currentTimeMillis();  
#          System.out.println("index:-----" + (endTime - startTime));  
#         //搜索  
#         long startTime1 = System.currentTimeMillis();  
#         testSearch();  
#         long endTime1 = System.currentTimeMillis();  
#         System.out.println("search:----" + (endTime1 - startTime1));

第一次是执行完索引后直接执行的搜索，第二次及以后都是将索引注释后去执行的。

1 楼 kafka0102 2010-07-25

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

我做的lucene性能测试，以及对其搜索速度的疑问，望高手指点

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

我做的lucene性能测试，以及对其搜索速度的疑问，望高手指点

评论

发表评论

相关推荐

最近访客更多访客>>