倒排索引的简单实现

黎明lm

浏览: 300616 次
性别:
来自: 北京

最近访客更多访客>>

baby孔祥超

jiazhigang

slipper-jay

woshiliukun

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

nutch

首先看一个例子：

假设有3篇文章，file1, file2, file3，文件内容如下：

file1 (单词1，单词2，单词3，单词4....)

file2 (单词a，单词b，单词c，单词d....)

file3 (单词1，单词a，单词3，单词d....)

那么建立的倒排索引就是这个样子：

单词1 (file1,file3)

单词2 (file1)

单词3 (file1,file3)

单词a (file2, file3)

....

倒排索引的概念很简单：就是将文件中的单词作为关键字，然后建立单词与文件的映射关系。当然，你还可以添加文件中单词出现的频数等信息。倒排索引是搜索引擎中一个很基本的概念，几乎所有的搜索引擎都会使用到倒排索引。

下面是我对于倒排索引的一个简单的实现。该程序对于输入的一段文字，查找出该词所出现的行号以及出现的次数。

import java.io.*;
import java.util.HashMap;
import java.util.Map;

public class InvertedIndex {

	private Map<String, Map<Integer, Integer>> index;
	private Map<Integer, Integer> subIndex;

	public void createIndex(String filePath) {
		index = new HashMap<String, Map<Integer, Integer>>();

		try {
			File file = new File(filePath);
			InputStream is = new FileInputStream(file);
			BufferedReader read = new BufferedReader(new InputStreamReader(is));

			String temp = null;
			int line = 1;
			while ((temp = read.readLine()) != null) {
				String[] words = temp.split(" ");
				for (String word : words) {
					if (!index.containsKey(word)) {
						subIndex = new HashMap<Integer, Integer>();
						subIndex.put(line, 1);
						index.put(word, subIndex);
					} else {
						subIndex = index.get(word);
						if (subIndex.containsKey(line)) {
							int count = subIndex.get(line);
							subIndex.put(line, count+1);
						} else {
							subIndex.put(line, 1);
						}
					}
				}
				line++;
			}
			read.close();
			is.close();
		} catch (IOException e) {
			System.out.println("error in read file");
		}
	}

	public void find(String str) {
		String[] words = str.split(" ");
		for (String word : words) {
			StringBuilder sb = new StringBuilder();
			if (index.containsKey(word)) {
				sb.append("word: " + word + " in ");
				Map<Integer, Integer> temp = index.get(word);
				for (Map.Entry<Integer, Integer> e : temp.entrySet()) {
					sb.append("line " + e.getKey() + " [" + e.getValue() + "] , ");
				}
			} else {
				sb.append("word: " + word + " not found");
			}
			System.out.println(sb);
		}
	}

	public static void main(String[] args) {
		InvertedIndex index = new InvertedIndex();
		index.createIndex("news.txt");
		index.find("I love Shanghai today");
	}
}

其中，输入文件news.txt内容为：

I am eriol
I live in Shanghai and I love Shanghai
I also love travelling
life in Shanghai
is beautiful

输出结果为：

word: I in line 1 [1] , line 2 [2] , line 3 [1] ,
word: love in line 2 [1] , line 3 [1] ,
word: Shanghai in line 2 [2] , line 4 [1] ,
word: today not found

0
顶

1
踩

分享到：

solr | hadoop SequenceFile 详解

2012-01-05 16:24
浏览 996
评论(0)
分类:行业应用
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论