- 浏览: 1637907 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (405)
- C/C++ (16)
- Linux (60)
- Algorithm (41)
- ACM (8)
- Ruby (39)
- Ruby on Rails (6)
- FP (2)
- Java SE (39)
- Java EE (6)
- Spring (11)
- Hibernate (1)
- Struts (1)
- Ajax (5)
- php (2)
- Data/Web Mining (20)
- Search Engine (19)
- NLP (2)
- Machine Learning (23)
- R (0)
- Database (10)
- Data Structure (6)
- Design Pattern (16)
- Hadoop (2)
- Browser (0)
- Firefox plugin/XPCOM (8)
- Eclise development (5)
- Architecture (1)
- Server (1)
- Cache (6)
- Code Generation (3)
- Open Source Tool (5)
- Develope Tools (5)
- 读书笔记 (7)
- 备忘 (4)
- 情感 (4)
- Others (20)
- python (0)
最新评论
-
532870393:
请问下,这本书是基于Hadoop1还是Hadoop2?
Hadoop in Action简单笔记(一) -
dongbiying:
不懂呀。。
十大常用数据结构 -
bing_it:
...
使用Spring MVC HandlerExceptionResolver处理异常 -
一别梦心:
按照上面的执行,文件确实是更新了,但是还是找不到kernel, ...
virtualbox 4.08安装虚机Ubuntu11.04增强功能失败解决方法 -
dsjt:
楼主spring 什么版本,我的3.1 ,xml中配置 < ...
使用Spring MVC HandlerExceptionResolver处理异常
The term vocabulary and postings lists
Inverted index construction step:
1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in.
2.1 Document delineation and character sequence decoding
Encoding Problems: how to auto-dectect encoding:
Text Format Problems: docs pdf xml html and so on.
Sequence Problems : Arabic(阿拉伯语), where text takes on some two dimensional and mixed order characteristics.
Choosing a document unit : A precision/recall tradeoff,large document units can be alleviated by use of explicit or implicit proximity search
2.2 Determining the vocabulary of terms
token
:tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time
throwing away certain characters, such as punctuation。
Difference between token and type:
token not exactly the same word sequence,is a instance
type is exactly the same work sequence,is a class
like the difference of OOP's class and instance.
Tokenization are language-specific
: Language identification
based on clas-
IDENTIFICATION sifiers that use short character subsequences as features is highly effective;
most languages have distinctive signature patterns
中文分词:最大正向/反向匹配。
专有名词识别,ip url,邮箱、电话号码识别。
2.2.2 Dropping common terms: stop words
Stop words: extremely common words has little value in helping select documents matching a user need.
How to Collect:
The general COLLECTION strategy for determining a stop list is to sort the terms by collection frequency and then to take the most frequent terms, often hand-filtered for their semantic
content relative to the domain of the documents being indexed, as a stop list.
2.2.3 Normalization (equivalence classing of terms)
Token normalization is the process of canonicalizing TOKEN tokens so that matches
occur despite superficial differences in the character sequences of the tokens.
不同写法:anti-discriminatory and antidiscriminatory
同义词:car and automobile
Accents and diacritics
Capitalization/case-folding
2.2.4 Stemming and lemmatization
The goal of both stemming and lemmatization is to reduce inflectional
forms and sometimes derivationally related forms of a word to a common
base form。
eg.:
am, are, is ⇒be
car, cars, car’s, cars’⇒car
Some common algorithm for stemming English:
Porter stemmer、Lovins stemmer、Paice stemmer
2.3 Faster postings list intersection via skip pointers
Postings lists intersection with skip pointers:
INTERSECTWITHSKIPS(p1, p2) 1 answer ← () 2 while p1 != NIL and p2 != NIL 3 do if docID(p1) = docID(p2) 4 then ADD(answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 else if docID(p1) < docID(p2) 8 then if hasSkip(p1) and (docID(skip(p1) ≤ docID(p2))) 9 then while hasSkip(p1) and (docID(skip(p1) ≤ docID(p2))) 10 do p1 ← skip(p1) 11 else p1 ← next(p1) 12 else if hasSkip(p2) and (docID(skip(p2) ≤ docID(p1))) 13 then while hasSkip(p2) and (docID(skip(p2) ≤ docID(p1))) 14 do p2 ← skip(p2) 15 else p2 ← next(p2) 16 return answer
2.4 Positional postings and phrase queries
Biword indexes
: One approach to handling phrases is to consider every pair of consecutive
terms in a document as a phrase.(Not a standard solution)
Biword Extension:
The concept of a biword index can be extended to longer sequences of
words, and if the index includes variable length word sequences, it is generally
referred to as a phrase index
Positional indexes :(most commonly employed)
store postings of the form docID: <position1, position2, . . . >
An algorithm for proximity intersection of postings lists p1 and p2:
POSITIONALINTERSECT(p1, p2, k) 1 answer ← () 2 while p1 != NIL and p2 != NIL 3 do if docID(p1) = docID(p2) 4 then l ← () 5 pp1 ← positions(p1) 6 pp2 ← positions(p2) 7 while pp1 != NIL 8 do while pp2 != NIL 9 do if |pos(pp1) − pos(pp2)| > k 10 then break 11 else ADD(l, pos(pp2)) 12 pp2 ← next(pp2) 13 while l != () and |l[0] − pos(pp1)| > k 14 do DELETE(l[0]) 15 for each ps ∈ l 16 do ADD(answer, hdocID(p1), pos(pp1), psi) 17 pp1 ← next(pp1) 18 p1 ← next(p1) 19 p2 ← next(p2) 20 else if docID(p1) < docID(p2) 21 then p1 ← next(p1) 22 else p2 ← next(p2) 23 return answer
Combination schemes :
Combination of biword indexes and positional indexes。
发表评论
-
Lucene 索引格式
2013-06-25 20:11 0索引结构: 索引层次 ... -
计算广告学
2012-08-12 13:53 0计算广告学一: 1、核 ... -
《Lucene in Action》简单笔记
2011-12-22 09:19 0第一章 Meet Lucene -
Information Retrieval Resources
2011-04-07 16:40 1370Information Retrieval Resource ... -
使用Jsoup抽取数据
2011-03-20 19:22 4918Jsoup是一个Java的HTML解析器,提供了非常方便的抽取 ... -
常见文件类型识别
2010-09-22 20:09 11791根据文件的后缀名识别文件类型并不准确,可以使用文件的头信息进行 ... -
(zz)信息检索领域资料整理
2010-06-05 13:05 3137A Guide to Information Retrieva ... -
Introduce to Inforamtion Retrieval读书笔记(1)
2009-10-25 23:49 2013很好的一本书,介绍的非常全面,看了很久了,还没有看完,刚看完前 ... -
Query Log Mining notes
2009-10-02 18:08 1246Enhancing Efficiency of Search ... -
百度搜索的一些高级语法
2009-08-27 20:06 18951.title语法 就是在title ... -
Hadoop好书推荐:Hadoop The Definitive Guide
2009-08-16 22:49 3616第一本详细介绍Hadoop的书籍,从网上下来看了几章,作者是H ... -
Java开源搜索引擎[收藏]
2008-04-24 00:09 2881Egothor Egothor是一个用Java编写的开 ... -
分享一本斯坦福的信息检索的教材
2008-01-04 23:59 2434斯坦福的信息检索的教材,还没出版,先分享一下电子版原稿. 对于 ... -
分享一本搜索引擎的电子书
2007-12-29 19:42 2498还没有来得及看,但搜索引擎的书不是很好找,先放上,希望对大家能 ... -
分享一个Nutch入门学习的资料
2007-12-18 20:49 4244分享一个Nutch入门学习的资料,感觉写的还不错. -
搜索引擎Nutch源代码研究之一 网页抓取(4)
2007-12-17 22:37 8356今天来看看Nutch如何Parse网页的: Nutch使用了两 ... -
[转]MAP/REDUCE:Google和Nutch实现异同及其他
2007-12-15 19:21 2953设计要素 nutch包含以下几个部分: 辅助类 Log:记载运 ... -
Nutch源代码学习小小总结一下
2007-12-15 19:13 4426我现在看得源码主要是网页抓取部分,这部分相对比较容易。我首先定 ... -
搜索引擎Nutch源代码研究之一 网页抓取(3)
2007-12-15 16:39 4547今天我们看看Nutch网页抓取,所用的几种数据结构: 主要涉及 ... -
搜索引擎Nutch源代码研究之一 网页抓取(2)
2007-12-15 00:36 5528今天我们来看看Nutch的源代码中的protocol-h ...
相关推荐
教材introduce to java programming 9th英文版,pdf,欢迎下载
Introduce to Algorithms, A Creative Approach .英文版
introduce to linux.html
目前为止找到的最详细的NS2说明文档 比官网的ns manual 还要详细
包括 Introduce to Java Programming 8th的全部课后习题答案(偶数以及奇数习题),还包括课本讲述过程中的习题。欢迎下载。
龙书 9~15章 的代码,"" 需加 L"", d3dutility.cpp 文件中需加 winmm.lib
最详细的 MIT 线性代数 公开课笔记 结合 MIT线性代数+《Introduce to linear algebra》书籍的详细中文翻译
线性优化讲义 introduction to linear optimization ,
EIB的控制网络及协议介绍,通过此文档可以理解EIB控制网络及布局,理解EIB协议设计的基本知识
team introduce.key
中文翻译Introduction to Linear Algebra, 5th Edition 5.2节(仅供交流学习)
这是对设计模式的简单简绍,希望对大家有用
Introduction to Lens Design
This book is meant to be an overview of the Tornado web server, and will walk readers through the basics of the framework, some sample applications, and best practices for use in the real world....
computer book
FFT 仅需要 n 乘以 12 log2 n 次乘法。我们将看到这是如何实现的。 FFT 彻底改变了信号处理。整个行业都因该思想而迅速发展。电气工程师是第一个知道其中区别 的人——当他们遇见你时会取你的傅里叶变换(假设你是个...
Gilbert Strang's textbooks have changed the entire approach to learning linear algebra -- away from abstract vector spaces to specific examples of the four fundamental subspaces: the column space and ...
Fuel_Gauge_introduce
LINUX INTRODUCE
计算机系统概论 英文版 作者: [美] Yale N. Patt'Introduction to Computing Systems: From bits & gates ... To understand the computer, the authors introduce the LC-3 and provide the LC-3 Simulator to give st