搜索引擎技术内幕之索引

wbj0110

浏览: 1551771 次
性别:
来自: 上海

最近访客更多访客>>

一往无前bhz

ninja2006

loginboot

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

索引
搜索引擎

索引搜索引擎

搜索引擎中索引的好坏直接影响着搜索引擎的性能，最终影响到用户的体验，可见索引的重要性。

今天我们就来谈谈索引技术。谈到索引大家第一想到的是倒排索引，的确倒排在全文检索中的优势，在搜索引擎中的大量使用令它声名鹊起。所以在此就以倒排进行分析。但是除了倒排索引外还有很多的索引方式，如静态索引方式有：位图、签名文件、倒排等；动态索引有：B树、B+树等等。

搜索引擎之所以大量使用倒排作为它内部的索引结构，本人觉得主要有两个原因：

1、容易实现、存储简单，更重要的一点是方便进行rank排序，当然还包括倒排列表可以压缩。像位图和签名文件就没有rank排序和压缩的优势。

2、方便查询结果处理，很容易实现布尔计算。现今的主流搜索引擎用的本质计算都属于布尔计算。如要查询包含A和B的文档，其实质是先找出包含A的文档列表，再找出包含B的文档列表，最后把那些既在列表A又在列表B的文档作为结果返回。其实质就是进行一个合取查询操作。

下面就以简单的程序方式来说明到排索引(基于内存的索引)的原理：

[java] view plain copy

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ReverseIndex {
/**
* 字典.<术语,倒排列表>
*/
private static Map<String,Set<Node>> dictionary = new HashMap<String,Set<Node>>();
private static Pattern extraPattern = Pattern.compile("(//w+)");
public void addTerm(String term){
if (term != null) {
term = term.trim().toLowerCase(); //大小写折叠
if (! dictionary.containsKey(term)) {
dictionary.put(term, new TreeSet<Node>());
}
}
}
//建立倒排列表
private void index(String term,Integer doc,int f){
term = term.toLowerCase();
Set<Node> reverseList = dictionary.get(term);
if(reverseList !=null){
reverseList.add(new Node(doc,f));
}
}
/**
* 这一步其实属于关键词抽取，在搜索引擎中由专门的抽取程序处理。
* @param txt 要建索引的文档
* @param doc 文档编号
*/
public void buildIndex(String txt,Integer doc){
if(txt == null)
return;
Matcher m = extraPattern.matcher(txt);
Map<String,Integer> map = new HashMap<String, Integer>();
while(m.find()){
//算没个词条的频率，准备使用坐标匹配的rank
String t = m.group(1);
if(! map.containsKey(t))
map.put(t, 1);
else
map.put(t, map.get(t) + 1);
}
for (Map.Entry<String, Integer> entry : map.entrySet()) {
index(entry.getKey(),doc,entry.getValue());
}
}
/*
* 对查询结果进行布尔查询中的合取操作
*/
private Set<Node> mergeResult(List<Set<Node>> queryResult){
if(queryResult == null || queryResult.size() == 0){
return new TreeSet<Node>();
}
Set<Node> min = null;
for (Set<Node> set : queryResult) {
//选择元素最少的查询列表，这样可以减少合取时计算量
if(min == null){
min = set;
}else if(min.size() > set.size()){
min = set;
}
}
Set<Node> ret = new TreeSet<Node>();
for (Node n : min) {
ret.add(n);
}
for (Node n : min) {
for (Set<Node> set : queryResult) {
if(min == set){
continue;
}else if(! set.contains(n)){
ret.remove(n); //如果在此查询词中不包括文档号，则直接删除此文档
break;
}
}
}
return ret;
}
/**
* 查询包含查询术语query的文档
* @param query
* @return 文档列表
*/
public Node[] retrieve(String query){
if(query ==null)
return new Node[0];
query = query.trim().toLowerCase();
if(query.length() ==0)
return new Node[0];
String[] terms = query.split("//s+");
List<Set<Node>> queryResults = new ArrayList<Set<Node>>();
for (String t : terms) {
Set<Node> reverseList = dictionary.get(t);
if (reverseList != null) {
queryResults.add(reverseList);
}
}
Set<Node> result = mergeResult(queryResults);
return result.toArray(new Node[result.size()]);
}
/**
* @param args
*/
public static void main(String[] args) {
String books = "This distribution includes cryptographic software. The country in /n"
+ "which you currently reside may have restrictions on the import, /n"
+ "possession, use, and/or re-export to another country, of /n"
+ "encryption software. BEFORE using any encryption software, please /n"
+ "check your country's laws, regulations and policies concerning the /n"
+ "import, possession, or use, and re-export of encryption software, to /n"
+ "see if this is permitted. See <http://www.wassenaar.org/> for more /n"
+ "information. /n"
+ "The U.S. Government Department of Commerce, Bureau of Industry and /n"
+ "Security (BIS), has classified this software as Export Commodity /n"
+ "Control Number (ECCN) 5D002.C.1, which includes information security /n"
+ "software using or performing cryptographic functions with asymmetric /n"
+ "algorithms. The form and manner of this Apache Software Foundation /n"
+ "distribution makes it eligible for export under the License Exception /n"
+ "ENC Technology Software Unrestricted (TSU) exception (see the BIS /n"
+ "Export Administration Regulations, Section 740.13) for both object /n"
+ "code and source code. /n"
+ "The following provides more details on the included cryptographic /n"
+ "software: /n"
+ " Hadoop Core uses the SSL libraries from the Jetty project written /n"
+ "by mortbay.org. /n" ;
/*
* 要建索引的文档，每一行为一个文档
*/
String[] docs = books.split("/n");
//字典文件
String[] dics = "the provides details Commerce Security Commodity Hadoop libraries TSU laws distribution software country reside import possession".split("//s+");
ReverseIndex index = new ReverseIndex();
for (String term : dics) {
//构建字典
index.addTerm(term);
}
int sno = 0;
for(String doc:docs){
//建立索引
index.buildIndex(doc, sno++);
}
System.err.println("please type query words:");
while(true){
Scanner in = new Scanner(System.in);
String query = in.nextLine();
if("exit".equalsIgnoreCase(query))
break;
System.out.println("query:"+query);
Node[] dosList = index.retrieve(query);
if(dosList.length ==0 )
{
System.out.printf("there are not docs relate to query words '%s'",query);
continue;
}
for(Node dc:dosList){
System.out.printf("[%d,%d] %s/n", dc.doc, dc.frequency, docs[dc.doc]);
}
}
}
static class Node implements Comparable<Node>{
private Integer doc;
private int frequency;
@Override
public int compareTo(Node o) {
return o.doc - doc ;
}
public Node(int doc,int f){
this.doc = doc;
this.frequency = f;
}
public Integer getDoc() {
return doc;
}
public int getFrequency() {
return frequency;
}
}
}

说明：

1、索引必须要有一个字典存在，字典就是一些术语的集合.如dog、cat、hotel、beijing .....索引引擎中的字典一般来自公共知识库，商场、娱乐新闻、电影简报等等。

2、索引以索引文件的形式存在磁盘上，在使用的时候加载进内存，或者部分加载进内存，大多数情况下内存不够存放所有的索引，所以有时候索引会进行压缩存储,字典文件和倒排列表都有相应的压缩方式。如字典中常用的压缩有前缀压缩、最小完美hash、基于磁盘的字典等；到排列表压缩有：一元编码等

3、这里给出的索引方式为基于内存的索引，在索引数据量大时不适用。当然这里把它放在内存是没问题，毕竟才一篇文章，文章的每一行作为一个文档对待。

4、真正的搜索引擎当到索引这一步时所有的数据都已经就绪，关键词的抽取(extract)、术语的权重等等。

分享到：

Redis 在新浪微博中的应用 | E: 无法获得锁 /var/lib/apt/lists/lock - ...

2014-05-03 12:52
浏览 617
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论