Lucene分析器的實現

cleaneyes

浏览: 337221 次
性别:
来自: 深圳

最近访客更多访客>>

张中文

u012363178

amo

muyuan

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Lucene

lucene

 public abstract class Analyzer {
    public abstract TokenStream tokenStream(String fieldName, Reader reader);
   *
   * @param fieldName Field name being indexed.
   * @return position increment gap, added to the next token emitted from {@link #tokenStream(String,Reader)}
   */
  public int getPositionIncrementGap(String fieldName)
  {
    return 0;
  }
}

String content = "...";
StringReader reader = new StringReader(content);

Analyzer analyzer = new ....();
TokenStream ts = analyzer.tokenStream("",reader);
//開始分詞
Token t = null;
while ((t = ts.next()) != null){
      System.out.println(t.termText());
}

分析器由兩部分組成。一部分是分詞器，被稱Tokenizer, 另一部分是過濾器，TokenFilter. 它們都繼承自TokenStream。一個分析器往由一個分詞器和多個過濾器組成。

public abstract class Tokenizer extends TokenStream {
  /** The text source for this Tokenizer. */
  protected Reader input;

  /** Construct a tokenizer with null input. */
  protected Tokenizer() {}

  /** Construct a token stream processing the given input. */
  protected Tokenizer(Reader input) {
    this.input = input;
  }

  /** By default, closes the input Reader. */
  public void close() throws IOException {
    input.close();
  }
}

public abstract class TokenFilter extends TokenStream {
  /** The source of tokens for this filter. */
  protected TokenStream input;

  /** Construct a token stream filtering the given input. */
  protected TokenFilter(TokenStream input) {
    this.input = input;
  }

  /** Close the input TokenStream. */
  public void close() throws IOException {
    input.close();
  }

}

StandardAnalyer的tokenStream方法，除了使用StatandTokenizer進行分詞外，還使用了3個Filtter:

StandardFilter 標準過濾器，主要對切分出來的省略語（如He's的's)，和以"."號分隔的縮略語進行處理。
LowerCaseFilter 大小寫轉換器，將大寫轉為小寫
StopFilter 忽略詞過濾器，構造其實例時，需傳入一個忽略詞集合。

 public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopSet);
    return result;
  }

stopSet在構造StandardAnalyer時指定，無構造參加時，使用默認的StopAnalyzer.ENGLISH_STOP_WORDS提供的過濾詞。

5
顶

0
踩

分享到：

疯狂英语口语之精华(一） | Lucene實戰開發手記(七）--- 關鍵詞提示

2008-05-22 16:01
浏览 1909
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene分析器的實現

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene分析器的實現

评论

发表评论

相关推荐

Cygwin

Lucene讀書筆記（四）

Lucene讀書筆記（三）

Lucene學習筆記(二）

Lucene學習筆記(一）

Heritrix入門

Lucene實戰開發手記(七）--- 關鍵詞提示

Lucene實戰開發手記(六）--- 搜索、刪除索引代碼

Lucene實戰開發手記(五）--- 為html/txt格式的文檔創建索引

Lucene實戰開發手記(四）--- 為PDF/excel/doc格式的文檔創建索引

Lucene實戰開發手記(三）--- 創建索引細節方法

Lucene實戰開發手記(二）--- 創建索引主方法

Lucene實戰開發手記(一）--- 前言

Lucene入門草稿

最近访客更多访客>>