- 浏览: 596422 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (263)
- 默认类别 (0)
- STRUTS HIBERNATE (2)
- STRUTS SPRING HIBERNATE (18)
- SOA/WEB service (8)
- PORTAL (3)
- 想法和设计思想 (17)
- SMARTEAM 二次开发 (0)
- ACTIVEBPEL (0)
- ERP (0)
- EAI (0)
- 甲醇汽油 (0)
- webwork freemarker spring hibernate (1)
- 工作流技术研究 (1)
- ROR (5)
- 搜索引擎 (7)
- 3.非技术区 (0)
- 1.网站首页原创Java技术区(对首页文章的要求: 原创、高质量、经过认真思考并精心写作。BlogJava管理团队会对首页的文章进行管理。) (2)
- 2.Java新手区 (2)
- 4.其他技术区 (0)
- ESB (1)
- Petals ESB (6)
- 手机开发 (1)
- docker dedecms (1)
最新评论
-
w630636065:
楼主,期待后续!!!!!!!!
生成文本聚类java实现 (2) -
zilong513:
十分感谢楼主,期待后续。
生成文本聚类java实现 (2) -
qqgoodluck:
可否介绍一下您的选型依据,包括Petal ESB与MULE等E ...
Petals ESB 简介 -
jackiee_cn:
写的比较清楚,学习了
Petals ESB 集群实战 -
忙两夜:
你好,能发一下源代码吗
抓取口碑网店铺资料
拜读了solr的部分源码,却急于弄明白solr的索引顺序和查询顺序,如下是探访结果.
所有的配置都在solr/example/solr/conf/schema.xml当中.
<!-- 如下是对text类型的处理 --> <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <!-- 索引顺序1空格2同义词3过滤词4拆字5小写过滤6关键字7词干抽取算法--> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <!-- 查询顺序1空格2同义词3过滤词4拆字5小写过滤6关键字7词干抽取算法--> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <!-- Less flexible matching, but less false matches. Probably not ideal for product names, but may be good for SKUs. Can insert dashes in the wrong place and still match. --> <!-- 针对textTight类型--> <fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" > <!-- 查询顺序1空格2同义词3过滤词4拆字5小写过滤6关键字7英文相近词8去除重复词 --> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with stemming. --> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <!-- A general unstemmed text field - good if one does not know the language of the field --> <!-- 针对textgen类型 --> <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100"> <!-- 索引顺序1空格2过滤词3拆字4小写过滤--> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <!-- 查询顺序1空格2同义词3过滤词4小写过滤--> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <!-- A general unstemmed text field that indexes tokens normally and also reversed (via ReversedWildcardFilterFactory), to enable more efficient leading wildcard queries. --> <!-- 针对text_rev类型 --> <fieldType name="text_rev" class="solr.TextField" positionIncrementGap="100"> <!-- 索引顺序1空格2过滤词3拆字4小写过滤6转义通配符--> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/> </analyzer> <!-- 查询顺序1空格2同义词3过滤词4拆字5小写过滤 --> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <!-- charFilter + WhitespaceTokenizer --> <!-- <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" > <analyzer> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> --> <!-- This is an example of using the KeywordTokenizer along With various TokenFilterFactories to produce a sortable field that does not include some properties of the source text --> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer> <!-- KeywordTokenizer does no actual tokenizing, so the entire input string is preserved as a single token --> <tokenizer class="solr.KeywordTokenizerFactory"/> <!-- The LowerCase TokenFilter does what you expect, which can be when you want your sorting to be case insensitive --> <filter class="solr.LowerCaseFilterFactory" /> <!-- The TrimFilter removes any leading or trailing whitespace --> <filter class="solr.TrimFilterFactory" /> <!-- The PatternReplaceFilter gives you the flexibility to use Java Regular expression to replace any sequence of characters matching a pattern with an arbitrary replacement string, which may include back references to portions of the original string matched by the pattern. See the Java Regular Expression documentation for more information on pattern and replacement string syntax. http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html --> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" /> </analyzer> </fieldType> <fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" > <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> </analyzer> </fieldtype> <fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" > <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- The DelimitedPayloadTokenFilter can put payloads on tokens... for example, a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f Attributes of the DelimitedPayloadTokenFilterFactory : "delimiter" - a one character delimiter. Default is | (pipe) "encoder" - how to encode the following value into a playload float -> org.apache.lucene.analysis.payloads.FloatEncoder, integer -> o.a.l.a.p.IntegerEncoder identity -> o.a.l.a.p.IdentityEncoder Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor. --> <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/> </analyzer> </fieldtype> <!-- lowercases the entire field value, keeping it as a single token. --> <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
大致的索引顺序会是:
1.空格..............................solr.WhitespaceTokenizerFactory
2同义词............................solr.SynonymFilterFactory
3过滤词...........................solr.StopFilterFactory
4拆字..............................solr.WordDelimiterFilterFactory
5小写过滤.....................solr.LowerCaseFilterFactory
6关键字.........................solr.KeywordMarkerFilterFactory
7词干抽取算法............solr.PorterStemFilterFactory
大致的搜素顺序是:
1.空格..............................solr.WhitespaceTokenizerFactory
2同义词............................solr.SynonymFilterFactory
3过滤词...........................solr.StopFilterFactory
4拆字..............................solr.WordDelimiterFilterFactory
5小写过滤.....................solr.LowerCaseFilterFactory
6关键字.........................solr.KeywordMarkerFilterFactory
7英文相近词..................solr.EnglishMinimalStemFilterFactory
8去除重复词.................solr.RemoveDuplicatesTokenFilterFactory
当然了,你可以根据自己的权重来重新分配索引和搜素顺序
发表评论
-
通过commons.httpclient 自动识别网页编码
2012-05-29 16:43 2806在自动化抓取网站的时候,最讨厌的就是不同的网站有多种不 ... -
偷梁换柱:MMSeg4j借用庖丁解牛的词库
2011-05-11 10:41 3321“……他不回答,对柜里说,“温两碗酒,要一碟茴香豆。”便 ... -
随想:迅速取得海量数据之结构化数据
2011-05-11 10:09 1727呵呵,要想瞬间取得需要的数据,比如新闻信息,而且能够分门别 ... -
生成文本聚类java实现 (2)
2011-04-12 10:02 8208呵呵,继续。 本节的学习内容: 4.从剩余的 ... -
抓取口碑网店铺资料
2011-04-11 10:56 2443呵呵,只为自己玩,哈哈。 技术难度: 1 ... -
solr基础知识
2010-12-27 15:43 1862对 solr1.4版本 准备 下载 ...
相关推荐
solr创建索引并查询,希望能够帮助有需要的人。。。
主要讲解了 solr客户端如何调用带账号密码的solr服务器调用,实现添加索引和查询索引,以及分组查询
Solr 索引 测试报告 性能
solr初学者很受用的!讲解了solr怎么创建索引的及其原理,以及查询
solr配置中文解析器和将数据导入solr索引库时所需的jar包
在tomcat中配置solr,以及solr 全文搜索建立索引的相关方法总结
solr在做检索的时候时常需要得知他的性能参数,此处使用8G内存,双核处理器测试的结果
NULL 博文链接:https://takeme.iteye.com/blog/1849781
Solr接受xml格式数据更新、提交、修改索引。
springboot、Dubbo、MySQL,源码web系统,框架,代码均经过严格测试,可直接运行,有需要可自取
Solr数据库插入(全量和增量)索引,全量一般用于第一次创建索引情况,批量一般更新数据部分创建索引。
solr增量导入更新索引包
包含solr介绍、全局索引介绍、ik分词器安装包、solr安装包、及各个部分的安装教程。
solr索引服务基础知识[收集].pdf
索引是设计表的一部分,创建的索引对sql的语句木有任何影响,对sql语句的执行效率有影响
solr定时索引(增量索引、完整索引)需要用到的jar包和配置 支持7.3版本
这是我用window xp的自己按装,总结了,现在共享,希望给新手有帮助,“搜索引擎solr环境配置、分词及索引操作”
Solr是一个独立的企业级搜索应用服务器,它对外提供类似于Web-service的API接口。...Solrj 是访问 Solr 的 Java 客户端,它提供添加、更新和查询Solr 索引的接口。http://wiki.chenlb.com/solr/doku.php?id=solrj
NULL 博文链接:https://stranger2008.iteye.com/blog/1812721
NULL 博文链接:https://mozhenghua.iteye.com/blog/2275318