`
forfuture1978
  • 浏览: 412937 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

有关Lucene的问题(2):stemming和lemmatization

阅读更多

问题:

我试验了一下文章中提到的 stemming 和 lemmatization

  • 将单词缩减为词根形式,如“cars”到“car”等。这种操作称为:stemming。
  • 将单词转变为词根形式,如“drove”到“drive”等。这种操作称为:lemmatization。

试验没有成功

代码如下:

public class TestNorms {   
    public void createIndex() throws IOException {   
        Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));   
        IndexWriter writer = new IndexWriter(d, new StandardAnalyzer(Version.LUCENE_30), 
                                                                                      true, IndexWriter.MaxFieldLength.UNLIMITED);   
        Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED);   
        Document doc = new Document();   
        field.setValue("Hello students was drive");   
        doc.add(field);   
        writer.addDocument(doc);   
        writer.optimize();   
        writer.close();   
    }   
    public void search() throws IOException {   
        Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));   
        IndexReader reader = IndexReader.open(d);   
        IndexSearcher searcher = new IndexSearcher(reader);   
        TopDocs docs = searcher.search(new TermQuery(new Term("desc","drove")), 10);   
        System.out.println(docs.totalHits);   
    }   
    public static void main(String[] args) throws IOException {   
        TestNorms test= new TestNorms();   
        test.createIndex();   
        test.search();   
    }   

不管是单复数,还是单词的变化,都是没有体现的

不知道是不是分词器的原因?

回答:

的确是分词器的问题,StandardAnalyzer并不能进行stemming和lemmatization,因而不能够区分单复数和词型。

文章中讲述的是全文检索的基本原理,理解了他,有利于更好的理解Lucene,但不代表Lucene是完全按照此基本流程进行的。

(1) 有关stemming

作为stemming,一个著名的算法是The Porter Stemming Algorithm,其主页为http://tartarus.org/~martin/PorterStemmer/,也可查看其论文http://tartarus.org/~martin/PorterStemmer/def.txt

通过以下网页可以进行简单的测试:Porter's Stemming Algorithm Online[http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]

cars –> car

driving –> drive

tokenization –> token

然而

drove –> drove

可见stemming是通过规则缩减为词根的,而不能识别词型的变化。

在最新的Lucene 3.0中,已经有了PorterStemFilter这个类来实现上述算法,只可惜没有Analyzer向匹配,不过不要紧,我们可以简单实现:

public class PorterStemAnalyzer extends Analyzer
{
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
      return new PorterStemFilter(new LowerCaseTokenizer(reader));
    }
}

把此分词器用在你的程序中,就能够识别单复数和规则的词型变化了。

public void createIndex() throws IOException {
  Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));
  IndexWriter writer = new IndexWriter(d, new PorterStemAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

  Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED);
  Document doc = new Document();
  field.setValue("Hello students was driving cars professionally");
  doc.add(field);

  writer.addDocument(doc);
  writer.optimize();
  writer.close();
}

public void search() throws IOException {
  Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));
  IndexReader reader = IndexReader.open(d);
  IndexSearcher searcher = new IndexSearcher(reader);
  TopDocs docs = searcher.search(new TermQuery(new Term("desc", "car")), 10);
  System.out.println(docs.totalHits);
  docs = searcher.search(new TermQuery(new Term("desc", "drive")), 10);
  System.out.println(docs.totalHits);
  docs = searcher.search(new TermQuery(new Term("desc", "profession")), 10);
  System.out.println(docs.totalHits);
}

(2) 有关lemmatization

至于lemmatization,一般是有字典的,方能够由"drove"对应到"drive".

在网上搜了一下,找到European languages lemmatizer[http://lemmatizer.org/],只不过是在linux下面C++开发的,有兴趣可以试验一下。

首先按照网站的说明下载,编译,安装:

libMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile:

# tar xzf libMAFSA-0.2.tar.gz
# cd libMAFSA-0.2/
# cmake .
# make
# sudo make install
After this you should install libturglem. You can download it at the same place.
# tar xzf libturglem-0.2.tar.gz
# cd libturglem-0.2
# cmake .
# make
# sudo make install
Next you should install english dictionaries with some additional features to work with.
# tar xzf turglem-english-0.2.tar.gz
# cd turglem-english-0.2
# cmake .
# make
# sudo make install

安装完毕后:

  • /usr/local/include/turglem是头文件,用于编译自己编写的代码
  • /usr/local/share/turglem/english是字典文件,其中lemmas.xml中我们可以看到"drove"和"drive"的对应,"was"和"be"的对应。
  • /usr/local/lib中的libMAFSA.a  libturglem.a  libturglem-english.a  libtxml.a是用于生成应用程序的静态库

<l id="DRIVE" p="6" />

<l id="DROVE" p="6" />

<l id="DRIVING" p="6" />

在turglem-english-0.2目录下有例子测试程序test_utf8.cpp

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <turglem/lemmatizer.h>
#include <turglem/lemmatizer.hpp>
#include <turglem/english/charset_adapters.hpp>

int main(int argc, char **argv)
{
        char in_s_buf[1024];
        char *nl_ptr;

        tl::lemmatizer lem;

        if(argc != 4)
        {
                printf("Usage: %s words.dic predict.dic flexias.bin\n", argv[0]);
                return -1;
        }

        lem.load_lemmatizer(argv[1], argv[3], argv[2]);

        while (!feof(stdin))
        {
                fgets(in_s_buf, 1024, stdin);
                nl_ptr = strchr(in_s_buf, '\n');
                if (nl_ptr) *nl_ptr = 0;
                nl_ptr = strchr(in_s_buf, '\r');
                if (nl_ptr) *nl_ptr = 0;

                if (in_s_buf[0])
                {
                        printf("processing %s\n", in_s_buf);
                        tl::lem_result pars;
                        size_t pcnt = lem.lemmatize<english_utf8_adapter>(in_s_buf, pars);
                        printf("%d\n", pcnt);
                        for (size_t i = 0; i < pcnt; i++)
                        {
                                std::string s;
                                u_int32_t src_form = lem.get_src_form(pars, i);
                                s = lem.get_text<english_utf8_adapter>(pars, i, 0);
                                printf("PARADIGM %d: normal form '%s'\n", (unsigned int)i, s.c_str());
                                printf("\tpart of speech:%d\n", lem.get_part_of_speech(pars, (unsigned int)i, src_form));
                        }
                }
        }

        return 0;
}

编译此文件,并且链接静态库:注意链接顺序,否则可能出错。

g++ -g -o output test_utf8.cpp -L/usr/local/lib/ -lturglem-english -lturglem -lMAFSA –ltxml

运行编译好的程序:

./output /usr/local/share/turglem/english/dict_english.auto

/usr/local/share/turglem/english/prediction_english.auto

/usr/local/share/turglem/english/paradigms_english.bin

做测试,虽然对其机制尚不甚了解,但是可以看到lemmatization的作用:

drove
processing drove
3
PARADIGM 0: normal form 'DROVE'
        part of speech:0
PARADIGM 1: normal form 'DROVE'
        part of speech:2
PARADIGM 2: normal form 'DRIVE'
        part of speech:2

was
processing was
3
PARADIGM 0: normal form 'BE'
        part of speech:3
PARADIGM 1: normal form 'BE'
        part of speech:3
PARADIGM 2: normal form 'BE'
        part of speech:3

分享到:
评论
1 楼 illu 2010-02-06  
谢谢您的回答 我这里还有一个问题不解
其实上面那个类本来是我测试norms时用的 没有测试成功 然后我才又测试了一下 stemming 与 lemmatization
问题是这样的

在创建Field时 都会用到Index
Index.NO
Index.ANALYZED
Index.NOT_ANALYZED
这三个我明白
但是其余两个
Index.NOT_ANALYZED_NO_NORMS
Index.ANALYZED_NO_NORMS
我还是搞不清楚。。

特别是norms这个概念 不知道是干什么
从字面上理解 norm 就是规范化
可能是去标点 去除一些无意义的词
但是根据我的测试
比如说用Index.ANALYZED 和Index.ANALYZED_NO_NORMS 结果是一模一样的
然后我又翻了下源码
如下:
/** Expert: Index the field's value without an Analyzer,
     * and also disable the storing of norms.  Note that you
     * can also separately enable/disable norms by calling
     * {@link Field#setOmitNorms}.  No norms means that
     * index-time field and document boosting and field
     * length normalization are disabled.  The benefit is
     * less memory usage as norms take up one byte of RAM
     * per indexed field for every document in the index,
     * during searching.  Note that once you index a given
     * field <i>with</i> norms enabled, disabling norms will
     * have no effect.  In other words, for this to have the
     * above described effect on a field, all instances of
     * that field must be indexed with NOT_ANALYZED_NO_NORMS
     * from the beginning. */
    NOT_ANALYZED_NO_NORMS {
      @Override
      public boolean isIndexed()  { return true;  }
      @Override
      public boolean isAnalyzed() { return false; }
      @Override
      public boolean omitNorms()  { return true;  }   	
    },

    /** Expert: Index the tokens produced by running the
     *  field's value through an Analyzer, and also
     *  separately disable the storing of norms.  See
     *  {@link #NOT_ANALYZED_NO_NORMS} for what norms are
     *  and why you may want to disable them. */
    ANALYZED_NO_NORMS {
      @Override
      public boolean isIndexed()  { return true;  }
      @Override
      public boolean isAnalyzed() { return true;  }
      @Override
      public boolean omitNorms()  { return true;  }   	
    };


还是搞不懂

所以还想问问您
Index.NOT_ANALYZED_NO_NORMS
Index.ANALYZED_NO_NORMS
norm的概念  NO_NORMS到底是做了什么
谢谢

相关推荐

    lucene-core-7.7.0-API文档-中文版.zip

    赠送jar包:lucene-core-7.7.0.jar; 赠送原API文档:lucene-core-7.7.0-javadoc.jar; 赠送源代码:lucene-core-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-core-7.7.0.pom; 包含翻译后的API文档:lucene...

    Lucene.Heritrix:开发自己的搜索引擎(第2版)

    Lucene.Heritrix:开发自己的搜索引擎(第2版).邱哲.扫描版 Lucene.Heritrix:开发自己的搜索引擎(第2版).邱哲.扫描版

    lucene-sandbox-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-sandbox-6.6.0.jar; 赠送原API文档:lucene-sandbox-6.6.0-javadoc.jar; 赠送源代码:lucene-sandbox-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-sandbox-6.6.0.pom; 包含翻译后的API...

    lucene-6.5.0工具包

    官网的lucene全文检索引擎工具包,下载后直接解压缩即可使用

    lucene-core-7.2.1-API文档-中文版.zip

    赠送jar包:lucene-core-7.2.1.jar; 赠送原API文档:lucene-core-7.2.1-javadoc.jar; 赠送源代码:lucene-core-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-core-7.2.1.pom; 包含翻译后的API文档:lucene...

    IKAnalyzer中文分词支持lucene6.5.0版本

    由于林良益先生在2012之后未对IKAnalyzer进行更新,后续lucene分词接口发生变化,导致不可使用,所以此jar包支持lucene6.0以上版本

    lucene-core-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-core-6.6.0.jar; 赠送原API文档:lucene-core-6.6.0-javadoc.jar; 赠送源代码:lucene-core-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-core-6.6.0.pom; 包含翻译后的API文档:lucene...

    lucene-solr:Apache Lucene和Solr开源搜索软件

    Apache Lucene和Solr Apache Lucene是用Java编写的高性能,功能齐全的文本搜索引擎库。 Apache Solr是使用Java编写并使用Apache Lucene的企业搜索平台。 主要功能包括全文搜索,索引复制和分片以及结果分面和突出...

    lucene-grouping-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-grouping-6.6.0.jar; 赠送原API文档:lucene-grouping-6.6.0-javadoc.jar; 赠送源代码:lucene-grouping-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-grouping-6.6.0.pom; 包含翻译后...

    lucene-suggest-7.7.0-API文档-中文版.zip

    赠送jar包:lucene-suggest-7.7.0.jar; 赠送原API文档:lucene-suggest-7.7.0-javadoc.jar; 赠送源代码:lucene-suggest-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-suggest-7.7.0.pom; 包含翻译后的API...

    lucene-join-7.2.1-API文档-中英对照版.zip

    赠送jar包:lucene-join-7.2.1.jar; 赠送原API文档:lucene-join-7.2.1-javadoc.jar; 赠送源代码:lucene-join-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-join-7.2.1.pom; 包含翻译后的API文档:lucene...

    lucene-memory-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-memory-6.6.0.jar; 赠送原API文档:lucene-memory-6.6.0-javadoc.jar; 赠送源代码:lucene-memory-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-memory-6.6.0.pom; 包含翻译后的API文档...

    lucene-highlighter-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-highlighter-6.6.0.jar; 赠送原API文档:lucene-highlighter-6.6.0-javadoc.jar; 赠送源代码:lucene-highlighter-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-highlighter-6.6.0.pom;...

    lucene-spatial-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-spatial-6.6.0.jar; 赠送原API文档:lucene-spatial-6.6.0-javadoc.jar; 赠送源代码:lucene-spatial-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-spatial-6.6.0.pom; 包含翻译后的API...

    lucene-misc-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-misc-6.6.0.jar; 赠送原API文档:lucene-misc-6.6.0-javadoc.jar; 赠送源代码:lucene-misc-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-misc-6.6.0.pom; 包含翻译后的API文档:lucene...

    lucene-suggest-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-suggest-6.6.0.jar; 赠送原API文档:lucene-suggest-6.6.0-javadoc.jar; 赠送源代码:lucene-suggest-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-suggest-6.6.0.pom; 包含翻译后的API...

    java开源系统源码-apache-lucene-beginning:“ApacheLucene开始:Java/开源/全文搜索系统的构建”的代

    java源系统apache-lucene开头 “ Apache Lucene的开始:Java /开源/全文搜索系统的构建”的代码kata。 这是关口浩二(Koji ...Lucene简介:Java,开放源代码和全文本搜索系统的构建”(2006年技术评论)的副本。

    Lucene 7.2.1 官方jar包

    Lucene 7.2.1 的官方jar包,windows环境下使用,核心jar包在core文件夹下。 可用于开发搜索引擎,最低分数供大家下载。

    lucene-queryparser-7.3.1-API文档-中文版.zip

    赠送jar包:lucene-queryparser-7.3.1.jar; 赠送原API文档:lucene-queryparser-7.3.1-javadoc.jar; 赠送源代码:lucene-queryparser-7.3.1-sources.jar; 赠送Maven依赖信息文件:lucene-queryparser-7.3.1.pom;...

    Lucene:基于Java的全文检索引擎简介

    2. 全文检索的实现:Luene全文索引和数据库索引的比较 3. 中文切分词机制简介:基于词库和自动切分词算法的比较 4. 具体的安装和使用简介:系统结构介绍和演示 5. Hacking Lucene:简化的查询分析器,删除的实现,...

Global site tag (gtag.js) - Google Analytics