`
yanfaguanli
  • 浏览: 661463 次
文章分类
社区版块
存档分类
最新评论

基于Lucene4.6+Solr4.6+Heritrix1.14+S2SH实战开发从无到有垂直搜索引擎

 
阅读更多

我这里有套课程想和大家分享,需要的朋友可以加我qq和我联系。QQ2059055336.

<wbr></wbr>

<wbr></wbr>

一、课程内容介绍:

<wbr></wbr>

<wbr><wbr>1、整体思路</wbr></wbr>

<wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr>整个课程,按照一个从无到有的过程来展开。所有的数据,来自于互联网,用heritrix去抓取。对于抓取的数据,进行去重,去标签,然后利用lucene 和 solr 进行索引和搜索。如下图所示:

<wbr><img alt="" src="http://www.ibeifeng.com/images/upload/Image/sl(2).jpg" title="基于Lucene4.6+Solr4.6+Heritrix1.14+S2SH实战开发从无到有垂直搜索引擎" style="margin:0px; padding:0px; border:medium none; list-style:none"></wbr>

<wbr><wbr><wbr>在网页去重、解析html讲解java开发,在搜索服务工具的封装中,讲解设计模式,项目的前端采用jquery,后台采用SSH2。</wbr></wbr></wbr>

<wbr></wbr>

<wbr></wbr>

<wbr><wbr><strong>2、内容安排:</strong></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><strong>一、理论部分:</strong></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.1、搭建heritrix</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.什么是网络爬虫</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.网络爬虫能做什么</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.Heritrix原理</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.Heritrix搭建</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><strong><wbr></wbr></strong><wbr><wbr><strong>2.2、如何进行主题抓取</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.什么是主题抓取</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.主题抓取的意义</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.主题抓取的策略</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.如何用heritrix进行主题抓取</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><strong><wbr></wbr></strong><wbr><wbr><strong>2.3、heritrix优化</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><strong><wbr></wbr></strong><wbr><wbr><wbr><wbr>1. ELFHash算法</wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><strong><wbr></wbr></strong><wbr><wbr><wbr><wbr>2.关于robot.txt</wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><strong><wbr></wbr></strong><wbr><wbr><wbr><wbr>3.将heritrix打包成工具</wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.4、解析html页面</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.java正则表达式</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.基于模板获取网页内容</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.利用htmlparser解析html</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><strong>2.5、中文分词介绍</strong></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.Lucene自带的分词</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.ICTCLAS</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.IK</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.利用机器学习的算法识别中文文章中的领域词</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><strong>2.6、网页去重</strong></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.网页去重的意义</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.网页去重的主要方法</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.什么是tf*idf</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.基于指纹算法的网页去重</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><strong>2.7、Lucene4.6快速索引与搜索</strong></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.如何用lucene创建索引</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.如何用lucene搜索结果</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.Lucene中intfield怎么搜索</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.Lucene的结果高亮显示</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><strong>2.8、Lucene4.6索引的相关操作</strong></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.创建索引</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.修改索引</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.删除索引</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.索引优化</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><strong>2.9、Lucene4.6的query、及queryparser</strong></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.TermQuery<wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.BooleanQuery</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.TermRangeQuery</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.NumericRangeQuery</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><wbr>5.PrefixQuery</wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>6.PhraseQuery</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>7.MultiPhraseQuery</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>8.FuzzyQuery</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>9.WildcardQuery</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>10.queryparser</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><strong>2.10、Lucene的Filter及自定义排序</strong></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.Filter</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.Lucene自带排序及指定权重</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.Lucene自定义排序</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.11、Solr快速索引与搜索</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.什么是solr</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.为什么工程中要使用solr</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.Solr的原理</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.如何在tomcat中运行solr</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>5.如何利用solr进行索引与搜索</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.12、Solr的查询及Filter</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.solr的各种查询</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.solr的Filter</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.solr的排序</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.solr的高亮</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.13、Solr的facet介绍</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.solr的某个域统计</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.solr的范围统计</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.14、Solrcloud集群搭建</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.zookeeper简介</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.solrcloud集群搭建</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.15、搜索服务的工具封装</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.工厂模式</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.封装搜索服务_lucene</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.封装搜索服务_solr</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr><wbr>4.将lucene与solr封装成可以配置的工具,可以支持任何业务系统</wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><strong>二、项目部分:</strong></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><strong>2.16、项目实战</strong></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>1.项目需求分析及框架选择</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>2.Struts 2.3.16介绍</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>3.Struts 2.3.16整合Spring 4.0.1</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>4.Spring 4.0.1整合hibernate 4.3.1</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>5.利用jquery-easyui 1.3.5 做后台管理页面</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>6.Heritrix 在工程中的运用</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>7.封装好的搜索框架在工程中的运用</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>8.Flexpaper模仿百度文库</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>9.文件上传</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>10.相关代码编写</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>11.搜索结果优化</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr><wbr>12.项目总结</wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><strong>三、课程亮点</strong></wbr></wbr></wbr></wbr>

<wbr></wbr>

<wbr><wbr><wbr><wbr><wbr>3.1 对heritrix进一步封装,可以按照需求配置,单独运行。</wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr>3.2 对lucene 4.6.0与solr 4.6.0进行封装,通过配置就可以对绝大多数的业务系统进行数据库及其文件的索引、搜索。</wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr>3.3 对目前最新的ssh(struts 2.3.16 spring 4.0.1 hibernate 4.3.1)整合,并结合目前最新的版本的jquery-easyui 1.3.5,构建了一个完整的垂直搜索引擎。</wbr></wbr></wbr></wbr></wbr>

<wbr><wbr><wbr><wbr><wbr>3.4 整个课程的理论部分,参看了大量的核心期刊论文,并针对目前中文分词,用纯java代码实现了一种基于无监督的识别方法。另外,实现了文本的特征抽取TF*IDF算法,最小编辑距离算法,文本相似度算法(传统的夹角余弦及指纹算法)。</wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>


分享到:
评论
1 楼 kgtw 2015-09-15  
1223137028@qq.com求发一份

相关推荐

Global site tag (gtag.js) - Google Analytics