Elasticsearch安装ik插件

donlianli

浏览: 336628 次
性别:
来自: 北京

最近访客更多访客>>

asia007

clive_hua

liuxuqing2010

FixedStar2K

博主相关

博客

微博

相册

留言

关于我

博客专栏

: Elasticsearch...
浏览量：216733

文章分类

社区版块

存档分类

博客分类：

ElasticSearch

elasticsearch ik插件插件安装

想要给elasticsearch安装一个中文分词插件，网上的资料都有点过时。

现在记录一下从源码安装ik插件的过程。

（注：我用的版本是0.90.2)。

1、下载源码

首先去ik的git网站下站源码，网址：https://github.com/medcl/elasticsearch-analysis-ik

下载完源码后，发现没有对应的jar包。我用mvn package，打了一个jar包。

打包后名称最后是：elasticsearch-analysis-ik-1.2.2.jar

2、文件拷贝。

这一步很简单，将jar包拷贝到ES_HOME/plugin/analysis-ik目录下面。

将config/ik目录下面的东西拷贝纸ES_HOME/config/ik目录下面（我在本机是window，es在linux上面，我是先将文件夹打包成zip包，然后到服务器上解压)。

3、增加配置

编辑elasticsearch.xml，在文件的最后增加下面代码：

index:
  analysis:
    analyzer:
      ik:
          alias: [ik_analyzer]
          type: org.elasticsearch.index.analysis.IkAnalyzerProvider
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true

然后重启elasitcsearch。

4、测试分词插件

这个我也不知道为啥使用下面命令不能测试。

curl 'http://localhost:9200/_analyze?analyzer=ik&pretty=true' -d'
{
	"text":"去北京怎么走"
}
'

但是从es的日志看，插件应该已经是加载了。

我安装ik插件的说明创建了一个索引，然后在索引下面使用上面的查询可以。

curl -XPUT http://localhost:9200/index

curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
{
    "fulltext": {
             "_all": {
            "indexAnalyzer": "ik",
            "searchAnalyzer": "ik",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "indexAnalyzer": "ik",
                "searchAnalyzer": "ik",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }
}'

//测试命令
curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'
{
	"text":"去北京怎么走"
}
'

测试分词效果如下：

{
"text":"去北京怎么走"
}
'
{
  "tokens" : [ {
    "token" : "text",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "ENGLISH",
    "position" : 1
  }, {
    "token" : "去",
    "start_offset" : 11,
    "end_offset" : 12,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "北京",
    "start_offset" : 12,
    "end_offset" : 14,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "怎么走",
    "start_offset" : 14,
    "end_offset" : 17,
    "type" : "CN_WORD",
    "position" : 4
  } ]
}

5、补充

当测试分词“中华人民共和国时"，发现竟然没有分词。如下：

 curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'  
> {  
>     "text":"中华人民共和国"  
> }  
> '
{
  "tokens" : [ {
    "token" : "text",
    "start_offset" : 12,
    "end_offset" : 16,
    "type" : "ENGLISH",
    "position" : 1
  }, {
    "token" : "中华人民共和国",
    "start_offset" : 19,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 2
  } ]
}

但这并非我们想要的结果，难道ik这么差，不会分词了？后来经过研究，发现ik有一个smart模式，并且默认是这个模式，在这种模式下，你搜索“中华人民共和国"，可能就搜不到仅包含“共和国"的文档。只需使用ik_max_word模式即可修复以上问题，关于分词器，继续探索中....。

curl 'http://localhost:9200/index/_analyze?analyzer=ik_max_word&pretty=true' -d'  
> {  
>     "text":"中华人民共和国"  
> }  
> '
{
  "tokens" : [ {
    "token" : "text",
    "start_offset" : 12,
    "end_offset" : 16,
    "type" : "ENGLISH",
    "position" : 1
  }, {
    "token" : "中华人民共和国",
    "start_offset" : 19,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "中华人民",
    "start_offset" : 19,
    "end_offset" : 23,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "中华",
    "start_offset" : 19,
    "end_offset" : 21,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "华人",
    "start_offset" : 20,
    "end_offset" : 22,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "人民共和国",
    "start_offset" : 21,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "人民",
    "start_offset" : 21,
    "end_offset" : 23,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "共和国",
    "start_offset" : 23,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 8
  }, {
    "token" : "共和",
    "start_offset" : 23,
    "end_offset" : 25,
    "type" : "CN_WORD",
    "position" : 9
  }, {
    "token" : "国",
    "start_offset" : 25,
    "end_offset" : 26,
    "type" : "CN_CHAR",
    "position" : 10
  } ]
}

请支持原创：

http://donlianli.iteye.com/blog/1948841

对这类话题感兴趣？欢迎发送邮件至donlianli@126.com

关于我：邯郸人，擅长Java，Javascript，Extjs，oracle sql。

更多我之前的文章，可以访问我的空间

2
顶

1
踩

分享到：

假如让我设计一个新的系统 | elasticsearch实战-使用G1垃圾回收

2013-09-28 16:50
浏览 7830
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论