`

lucene3.6.0的高级搜索相关技术

 
阅读更多

高级搜索技术:

排序 默认排序按照相关性,
public class Sort
implements Serializable {

  /**
   * Represents sorting by computed relevance. Using this sort criteria returns
   * the same results as calling
   * {@link Searcher#search(Query,int) Searcher#search()}without a sort criteria,
   * only with slightly more overhead.
   */
  public static final Sort RELEVANCE = new Sort();  相关性

  /** Represents sorting by index order. */
  public static final Sort INDEXORDER = new Sort(SortField.FIELD_DOC);  按照索引顺序,跟相关性顺序不一样

  /** Sorts by the criteria in the given SortField. */
  public Sort(SortField field) {
    setSort(field);
  }

指定排序字段之后,如果排序字段相同,则按照索引顺序再进行排序。

排序字段SortField.java


 /** Represents sorting by document score (relevance). */
  public static final SortField FIELD_SCORE = new SortField(null, SCORE);

  /** Represents sorting by document number (index order). */
  public static final SortField FIELD_DOC = new SortField(null, DOC);

  private String field;
  private int type;  // defaults to determining type dynamically
  private Locale locale;    // defaults to "natural order" (no Locale)
  boolean reverse = false;  // defaults to natural order
  private FieldCache.Parser parser;

  // Used for CUSTOM sort
  private FieldComparatorSource comparatorSource;

  private Object missingValue;
  /** Creates a sort, possibly in reverse, by terms in the given field with the
   * type of term values explicitly given.
   * @param field  Name of field to sort by.  Can be <code>null</code> if
   *               <code>type</code> is SCORE or DOC.
   * @param type   Type of values in the terms.
   * @param reverse True if natural order should be reversed.
   */
  public SortField(String field, int type, boolean reverse) {
    initFieldType(field, type);
    this.reverse = reverse;
  }

PhraseQuery.java 允许多个项对应同一个位置进行查询,同义词查询
MultiPhraseQuery.java是对PhraseQuery的进一步扩展
  /**
   * Adds a term to the end of the query phrase.
   * The relative position of the term within the phrase is specified explicitly.
   * This allows e.g. phrases with more than one term at the same position
   * or phrases with gaps (e.g. in connection with stopwords).
   * 
   * @param term
   * @param position
   */
  public void add(Term term, int position) {
      if (terms.size() == 0)
          field = term.field();
      else if (term.field() != field)
          throw new IllegalArgumentException("All phrase terms must be in the same field: " + term);

      terms.add(term);
      positions.add(Integer.valueOf(position));
      if (position > maxPosition) maxPosition = position;
  }

也支持slop
  /** Sets the number of other words permitted between words in query phrase.
    If zero, then this is an exact phrase search.  For larger values this works
    like a <code>WITHIN</code> or <code>NEAR</code> operator.

    <p>The slop is in fact an edit-distance, where the units correspond to
    moves of terms in the query phrase out of position.  For example, to switch
    the order of two words requires two moves (the first move places the words
    atop one another), so to permit re-orderings of phrases, the slop must be
    at least two.

    <p>More exact matches are scored higher than sloppier matches, thus search
    results are sorted by exactness.

    <p>The slop is zero by default, requiring exact matches.*/
  public void setSlop(int s) { slop = s; }

实现多个域上的查询MultiFieldQueryParser.java

跨度查询SpanQuery.java,还需要返回相同项的不同位置信息
  在一个域的起点查找跨度SpanFirstQuery.java,在文档的开始end个token查询某个值
   /** Matches spans near the beginning of a field.
 * <p/> 
 * This class is a simple extension of {@link SpanPositionRangeQuery} in that it assumes the
 * start to be zero and only checks the end boundary.
 *
 *
 *  */
public class SpanFirstQuery extends SpanPositionRangeQuery {

  /** Construct a SpanFirstQuery matching spans in <code>match</code> whose end
   * position is less than or equal to <code>end</code>. */
  public SpanFirstQuery(SpanQuery match, int end) {
    super(match, 0, end);
  }

 彼此相邻的跨度SpanNearQuery.java

/** Matches spans which are near one another.  One can specify <i>slop</i>, the
 * maximum number of intervening unmatched positions, as well as whether
 * matches are required to be in-order. */
public class SpanNearQuery extends SpanQuery implements Cloneable {
  protected List<SpanQuery> clauses;
  protected int slop;
  protected boolean inOrder;

  protected String field;
  private boolean collectPayloads;

  /** Construct a SpanNearQuery.  Matches spans matching a span from each
   * clause, with up to <code>slop</code> total unmatched positions between
   * them.  * When <code>inOrder</code> is true, the spans from each clause
   * must be * ordered as in <code>clauses</code>.
   * @param clauses the clauses to find near each other
   * @param slop The slop value
   * @param inOrder true if order is important
   * */
  public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) {
    this(clauses, slop, inOrder, true);     
  }

排序跨度交替SpanNotQuery.java
/** Removes matches which overlap with another SpanQuery. */
public class SpanNotQuery extends SpanQuery implements Cloneable {
  private SpanQuery include;
  private SpanQuery exclude;

全局跨度查询SpanOrQuery.java
/** Matches the union of its clauses.*/
public class SpanOrQuery extends SpanQuery implements Cloneable {
  private List<SpanQuery> clauses;
  private String field;

  /** Construct a SpanOrQuery merging the provided clauses. */
  public SpanOrQuery(SpanQuery... clauses) {

    // copy clauses array into an ArrayList
    this.clauses = new ArrayList<SpanQuery>(clauses.length);
    for (int i = 0; i < clauses.length; i++) {
      addClause(clauses[i]);
    }
  }

filter过滤器
CachingWrapperFilter.java能够将第一次查询结果缓存起来,后面可重用
QueryWrapperFilter.java可以把查询结果作为接下来的搜索的可用文档集
TermRangeFilter.java对搜索结果进一步进行过滤
/**
 * A Filter that restricts search results to a range of term
 * values in a given field.
 *
 * <p>This filter matches the documents looking for terms that fall into the
 * supplied range according to {@link
 * String#compareTo(String)}, unless a <code>Collator</code> is provided. It is not intended
 * for numerical ranges; use {@link NumericRangeFilter} instead.
 *
 * <p>If you construct a large number of range filters with different ranges but on the 
 * same field, {@link FieldCacheRangeFilter} may have significantly better performance. 
 * @since 2.9
 */
public class TermRangeFilter extends MultiTermQueryWrapperFilter<TermRangeQuery> {

自定义安全过滤器,查询的文档集要在某个用户的数据空间内
ChainedFilter.java过滤器链
FilteredQuery.java

对多个索引的搜索  lucene3.6.0建议使用MultiReader.java
MultiSearcher.java 

多线程搜索  对多个索引进行远程搜索
/** An IndexReader which reads multiple, parallel indexes.  Each index added
 * must have the same number of documents, but typically each contains
 * different fields.  Each document contains the union of the fields of all
 * documents with the same document number.  When searching, matches for a
 * query term are from the first index added that has the field.
 *
 * <p>This is useful, e.g., with collections that have large fields which
 * change rarely and small fields that change more frequently.  The smaller
 * fields may be re-indexed in a new index and both indexes may be searched
 * together.
 *
 * <p><strong>Warning:</strong> It is up to you to make sure all indexes
 * are created and modified the same way. For example, if you add
 * documents to one index, you need to add the same documents in the
 * same order to the other indexes. <em>Failure to do so will result in
 * undefined behavior</em>.
 */
public class ParallelReader extends IndexReader {


项向量 term vector

IndexReader.java
  /**
   * Return an array of term frequency vectors for the specified document.
   * The array contains a vector for each vectorized field in the document.
   * Each vector contains terms and frequencies for all terms in a given vectorized field.
   * If no such fields existed, the method returns null. The term vectors that are
   * returned may either be of type {@link TermFreqVector}
   * or of type {@link TermPositionVector} if
   * positions or offsets have been stored.
   * 
   * @param docNumber document for which term frequency vectors are returned
   * @return array of term frequency vectors. May be null if no term vectors have been
   *  stored for the specified document.
   * @throws IOException if index cannot be accessed
   * @see org.apache.lucene.document.Field.TermVector
   */
  abstract public TermFreqVector[] getTermFreqVectors(int docNumber)
          throws IOException;


通过文档获得域对应的项向量可以计算文档之间的相似度,从而可以进行相似查询或者推荐.
 
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics