mahout中LDA简介以及示例

sharp-fcc

浏览: 105429 次
性别:
来自: 北京

最近访客更多访客>>

wangyy

u012363178

plisking

xhinliang

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

模型

mahout LDA cvb topic model 主题模型

翻译自： https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation

简介：

Latent Dirichlet Allocation (Blei et al, 2003)是一个强大的学习方法将words聚到一些topics里面，以及把一些document表示成topics的一些集合。

主题模型就是document在topics上的概率分布，和words在topics上的分布的一个层次贝叶斯模型，举个例子，一个topic是包括“体育”，“篮球”，"全垒打"等词，一个document讲述一些在篮球比赛中使用违禁药，可能包含"体育"，“篮球”，“违禁药”，这些词，是事先被人类定义的标签，算法只不过给这些词跟概率关联上。模型中参数估计的目的是把这些topic学习出来，一个document跟这些topic的概率是多少。

另一个理解主题模型的视角是把他看作类似于 Dirichlet Process Clustering 的混合模型，从一个正常的混合模型开始,我们有一个全局混合的几个分布，我们可以说每一个document都有他全局分布之上自己的一个分布，在dirichlet process clustering中，每一个document在全局混合分布上有他自己的隐变量决定他属于哪个模型，在LDA中每一个词又有在document上的一个分布。

我们按照一定概率混合一些模型来解释已观测到的数据，每一个被观测到的数据假设是来自于许多模型中的一个，但是我们并不知道来自于哪一个，所以我们用一个称之为隐含变量的名字来指他从哪里来。

Collapsed Variational Bayes

CVB算法在LDA mahout的实现中结合了variational bayes 和 gibbs sampling .

使用方法：

mahout中LDA的实现需要工作在一个稀疏的词频的向量上，词频一定要是一个非负数的，在概率模型中，负数没有意义，确保用的是TF而不是IDF作为词频。

调用方法如下：

bin/mahout cvb \
    -i <input path for document vectors> \
    -dict <path to term-dictionary file(s) , glob expression supported> \
    -o <output path for topic-term distributions>
    -dt <output path for doc-topic distributions> \
    -k <number of latent topics> \
    -nt <number of unique features defined by input document vectors> \
    -mt <path to store model state after each iteration> \
    -maxIter <max number of iterations> \
    -mipd <max number of iterations per doc for learning> \
    -a <smoothing for doc topic distributions> \
    -e <smoothing for term topic distributions> \
    -seed <random seed> \
    -tf <fraction of data to hold for testing> \
    -block <number of iterations per perplexity check, ignored unless test_set_percentage>0> \

选择topic的数量的时候，建议多试几次。

在运行LDA之后，可以使用工具打印出来结果：

bin/mahout ldatopics \
    -i <input vectors directory> \
    -d <input dictionary file> \
    -w <optional number of words to print> \
    -o <optional output working directory. Default is to console> \
    -h <print out help> \
    -dt <optional dictionary type (text|sequencefile). Default is text>

示例：

在mahout/examples/bin/build-reuters.sh 有详细的示例脚本，脚本自动下载数据集，建立lucence索引，把lucence索引再变成向量的形式,注释掉最后两行，让他运行你的LDA,打印出来结果。

把样例改成你所需要的形式，需要自己建立lucence索引，需要一个adapter，剩下的东西都差不多。

参数估计：

使用EM算法。

分享到：

进程cpu过高问题排查 | awk and hadoop 之reducer

2013-11-18 13:07
浏览 4234
评论(1)
分类:互联网
查看更多

1 楼 chenbaiyang12csdn 2016-04-12

你好，运行LDA之后，使用工具打印出来结果这一部分能否给予一个细致的描述，我用自己的新闻集算出文档->主题的概率以及主题->词的概率后如何得到每个主题的代表词

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论