Lucene的排序修改

jeafyezheng

sdf_hz

博客

微博

相册

留言

关于我

修改Similarity（相似度计算）<o:p></o:p>

DefaultSimilarity基本上可以满足一般的搜索要求。但是在有些应用中，你可以定制你自己的Similarity来服务你自己的应用需求。例如：有些人认为没有必要让文档短的文章得分更高一点 (参考 a "fair" similarity).<o:p></o:p>

修改Similarity需要同时对索引和搜索都进行修改，必须在搜索或者排序之间修改Similarity。

要定制你自己的Similarity，也就是你不想直接使用DefaultSimilarity，你只要在建立索引的之前调用IndexWriter.setSimilarity，或者在搜索之前调用Searcher.setSimilarity.

你如果想知道，别人都是怎么修改similarity的，你可以参考一下Lucene的邮件列表Overriding Similarity. 总的来说有下面这些修改:

SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".

因为你对你自己的数据更了解，所以你有必要重写自己的Similarity方法。<o:p></o:p>

定制你自己的评分系统（专家级）<o:p></o:p>

修改评分系统是专家级的工作，所以你要谨慎工作，随时和别人交流。在Lucene中，修改评分系统将比修改similarity更加能够影响结果。Lucene的评分系统是一个非常复杂的机制，主要由下面三个类来实现： <o:p></o:p>

Query -- The abstract object representation of the user's information need.
Weight -- The internal interface representation of the user's Query, so that Query objects may be reused.
Scorer -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.

下面我来具体介绍一下这三个类：

The Query Class<o:p></o:p>

从某种意义上来说，Query是评分开始的地方。没有查询就没有什么可以评分的。更重要的是它是其他的评分系统的催化剂，由它来生成其他的评分系统，然后将他们整合起来。Query有一些重要的方法需要被继承:

createWeight(Searcher searcher) -- A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
rewrite(IndexReader reader) -- Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, OTHERS????

The Weight Interface<o:p></o:p>

Weight 接口<o:p></o:p>

权重接口主要用来定义Query的一个代表实现接口，所以可以被重用。任何可以用来被搜索的类都应该内置一个Weight，而不是在Query类。这个接口定义了6个要被执行的方法：

Weight#getQuery() -- Pointer to the Query that this Weight represents.
Weight#getValue() -- The weight for this Query. For example, the TermQuery.TermWeight value is equal to the idf^2 * boost * queryNorm
Weight#sumOfSquaredWeights() -- The sum of squared weights. Tor TermQuery, this is (idf * boost)^2
Weight#normalize(float) -- Determine the query normalization factor. The query normalization may allow for comparing scores between queries.
Weight#scorer(IndexReader) -- Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
Weight#explain(IndexReader, int) -- Provide a means for explaining why a given document was scored the way it was.

The Scorer Class<o:p></o:p>

评分类：<o:p></o:p>

Scorer是评分的抽象类，提供一些基本的计分功能供所有的评分类实现，是Lucene评分机制的核心类。Scorer定义了一下的方法，必须被实现。:

Scorer#next() -- Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
Scorer#doc() -- Returns the id of the Document that contains the match. Is not valid until next() has been called at least once.
Scorer#score() -- Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer returns the tf * Weight.getValue() * fieldNorm.
Scorer#skipTo(int) -- Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, skipTo can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
Scorer#explain(int) -- Provides details on why the score came about.