Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. For a general introduction to topic modeling, see for example
Probabilistic Topic Models by Steyvers and Griffiths (2007).
For an example showing how to use the Java API to import data, train models, and infer topics for new documents, see the
topic model developer's guide.
The MALLET topic model package includes an extremely fast and highly scalable implementation of Gibbs sampling, efficient methods for document-topic hyperparameter optimization, and tools for inferring topics for new documents given trained models.
Importing Documents: Once MALLET has been
downloaded and installed, the next step is to import text files into MALLET's internal format. The following instructions assume that the documents to be used as input to the topic model are in separate files, in a directory that contains no other files. See the introduction to
importing data in MALLET for more information and other import methods.
Change to the MALLET directory and run the command
bin/mallet import-dir --input /data/topic-input --output topic-input.mallet \
--keep-sequence --remove-stopwords
To learn more about options for the
import-dir command use the argument "--help".
Building Topic Models: Once you have imported documents into MALLET format, you can use the
train-topics command to build a topic model, for example:
bin/mallet train-topics --input topic-input.mallet \
--num-topics 100 --output-state topic-state.gz
Use the option --help to get a complete list of options for the train-topics command. Commonly used options include:
--input [FILE] Use this option to specify the MALLET collection file you created in the previous step.
--num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. The default (10) will provide a broad overview of the contents of the corpus. The number of topics should depend to some degree on the size of the collection, but 200 to 400 will produce reasonably fine-grained results.
--num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
Hyperparameter Optimization
--optimize-interval [NUMBER] This option turns on hyperparameter optimization, which allows the model to better fit the data by allowing some topics to be more prominent than others. Optimization every 10 iterations is reasonable.
--optimize-burn-in [NUMBER] The number of iterations before hyperparameter optimization begins. Default is twice the optimize interval.
Model Output
--output-model [FILENAME] This option specifies a file to write a serialized MALLET topic trainer object. This type of output is appropriate for pausing and restarting training, but does not produce data that can easily be analyzed.
--output-state [FILENAME] Similar to output-model, this option outputs a compressed text file containing the words in the corpus with their topic assignments. This file format can easily be parsed and used by non-Java-based software. Note that the state file will be GZipped, so it is helpful to provide a filename that ends in .gz.
--output-doc-topics [FILENAME] This option specifies a file to write the topic composition of documents. See the --help options for parameters related to this file.
--output-topic-keys [FILENAME] This file contains a "key" consisting of the top k words for each topic (where k is defined by the --num-top-words option). This output can be useful for checking that the model is working as well as displaying results of the model. In addition, this file reports the Dirichlet parameter of each topic. If hyperparamter optimization is turned on, this number will be roughly proportional to the overall portion of the collection assigned to a given topic.
Topic Inference
--inferencer-filename [FILENAME] Create a topic inference tool based on the current, trained model. Use the MALLET command bin/mallet infer-topics --help to get information on using topic inference.
Note that you must make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.
Topic Held-out probability
--evaluator-filename [FILENAME] The previous section describes how to get topic proportions for new documents. We often want to estimate the log probability of new documents, marginalized over all topic configurations. Use the MALLET command bin/mallet evaluate-topics --help to get information on using held-out probability estimation.
As with topic inference, you must make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.
分享到:
相关推荐
Map of science with topic modeling Comparison of unsupervised learning
基于最小领域知识的主题建模 ,一种基于融合知识的主题模型的微博话题发现方法,涉及自然语言处理领域 传统的主题挖掘技术基于概率统计的混合模型,对文本信息进行建模,使得模型能够自动挖掘出文本中潜在的语义信息...
TopicModeling, 关于 Apache Spark,主题建模 基于的主题建模研究这里软件包包含一组在Spark上实现的分布式文本建模算法,包括:在线 : 将实现的早期版本合并到了( PR #4419), 和几个扩展( 比如,预测) 之中。...
In this paper, we studied the novel problem of targeted modeling. Instead of nding all topics from a corpus like existing models based on full modeling, the proposed model focuses on nding topics of a...
Incorporating Knowledge Graph Embeddings into Topic Modeling
一种从文本文档集合中发现主题的图形工具。
LDA主题建模 潜在的Dirichlet分配(LDA)主题建模的基于浏览器的PureScript实现。...cd lda-topic-modeling # Install nvm and npm. nvm use npm install -g bower npm install bower install npm run build cd build
主题建模 主题模型是文档集合的简化表示。 主题建模软件使用主题标签识别单词,从而使经常出现在同一文档中的单词更有可能收到相同的标签。 它可以识别文档集合中的常见主题-具有相似含义和联想的单词簇-以及随着...
A Topic Modeling Perspective for Tracking the Evolution of Social Emotions
主题建模 一个从头复制了多个主题建模算法的仓库 pLSA概率潜在语义分析-plsa.py 原始论文可在中找到,使用EM算法估计主题分布,每个文档中的单词分布 潜在狄利克雷分配(LDA)算法-lda.py 原始论文可以在这里找到 ,...
主题建模数据集 一时兴起,我在这里集中了一些用于主题建模的测试数据集。 请! 文件夹结构: 数据集在Data子文件夹中按Data Format > Dataset Parent Folder > Data Files 组织。 在提出拉取请求时,请保持这种...
COVID-19-Twitter-Textual-and-Visual-Topic-Modeling-over-time:texutal部分结果
Topic modeling for large-scale text data
主题建模工具 MALLET的LDA实现的更新的GUI。 新的功能: 元数据整合 自动文件分割 自定义CSV分隔符 Alpha / Beta优化 自定义正则表达式标记化 多核处理器支持 ... 要立即开始使用其中一些新功能,请查阅。...
主题建模 主题建模可改善图书馆搜索 概要: 该项目旨在通过使用主题建模算法(特别是潜在的狄利克雷分配(LDA))来改进在联合图书馆中进行的搜索。主题建模是一种统计模型,用于发现文档集中出现的抽象主题。...
musicRecommendation_topicmodeling 个人项目:使用主题建模的音乐推荐。 由于音乐是我放学时唯一的激情,所以我喜欢花费大量时间阅读音乐评论家和收听各种专辑。 我还使用了那些流行的网络音乐播放器,例如Spotify...
Encoding Category Correlations into Bilingual Topic Modeling for Cross-Lingual Taxonomy Alignment
A Novel Approach to Bug Triage with Topic Modeling and Heterogeneous Network Analysis
pyLDAvis_Optimized_TopicModeling 使用Sk-learn建立LDA模型并使用pyLDAvis绘制主题间距离图 作者:丹麦Anis和Barsha Saha博士 联络方式: 该项目的目的是优化主题模型,以使用网格搜索方法实现最佳拟合。 主题...
the books show in down-to-earth terms how to select and use an appropriate range of statistical techniques in a particular practical field within each title’s special topic area. The books provide ...