ElasticSearch (5) – ICU分词器
icu_分词器 和 标准分词器 使用同样的 Unicode 文本分段算法, 只是为了更好的支持亚洲语,添加了泰语、老挝语、中文、日文、和韩文基于词典的词汇识别方法,并且可以使用自定义规则将缅甸语和柬埔寨语文本拆分成音节。
icu 分词器在默认的ElasticSearch当中是不自带的,需要另外安装。
ICU分词器的安装方法
1. 去到ElasticSeach / bin 路径
cd elasticsearch/bin
2. 安装插件:ICU分词器
./elasticsearch-plugin install analysis-icu
3. 安装完毕了需要重启ElasticSearch
ICU 分词器使用展示
#ICU分词测试 GET _analyze { "analyzer": "icu_analyzer", "text":"股市投资稳赚不赔必修课:如何做好仓位管理和情绪管理" } # 响应如下: { "tokens" : [ { "token" : "股市", "start_offset" : 0, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "投资", "start_offset" : 2, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "稳赚", "start_offset" : 4, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "不", "start_offset" : 6, "end_offset" : 7, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "赔", "start_offset" : 7, "end_offset" : 8, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "必修", "start_offset" : 8, "end_offset" : 10, "type" : "<IDEOGRAPHIC>", "position" : 5 }, { "token" : "课", "start_offset" : 10, "end_offset" : 11, "type" : "<IDEOGRAPHIC>", "position" : 6 }, { "token" : "如何", "start_offset" : 12, "end_offset" : 14, "type" : "<IDEOGRAPHIC>", "position" : 7 }, { "token" : "做好", "start_offset" : 14, "end_offset" : 16, "type" : "<IDEOGRAPHIC>", "position" : 8 }, { "token" : "仓", "start_offset" : 16, "end_offset" : 17, "type" : "<IDEOGRAPHIC>", "position" : 9 }, { "token" : "位", "start_offset" : 17, "end_offset" : 18, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "管理", "start_offset" : 18, "end_offset" : 20, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "和", "start_offset" : 20, "end_offset" : 21, "type" : "<IDEOGRAPHIC>", "position" : 12 }, { "token" : "情绪", "start_offset" : 21, "end_offset" : 23, "type" : "<IDEOGRAPHIC>", "position" : 13 }, { "token" : "管理", "start_offset" : 23, "end_offset" : 25, "type" : "<IDEOGRAPHIC>", "position" : 14 } ] }
Facebook评论