Dirichlet clustering starts with a data set of points and a ModelDistribution. Think of ModelDistribution as a class that generates different models. You create an empty model and try to assign points to it. When this happens, the model crudely grows or shrinks its parameters to try and fit the data. Once it does this for all points, it re-estimates the parameters of the model precisely using all the points and a partial probability of the point belonging to the model.
At the end of each pass, you get a number of samples that contain the probabilities, models, and assignment of points to models. These samples could be regarded as clusters, and they provide information about the models and their parameters, such as their shape and size. Moreover, by examining the number of models in each sample that have some points assigned to them, you can get information about how many models (clusters) the data supports. Also, by examining how often two points are assigned to the same model, you can get an approximate measure of how likely these points are to be explained by the same model. Such soft-membership information is a side product of using model-based clustering. Dirichlet clustering is able to capture the partial probabilities of points belonging to various models.
Dirichlet clustering is a powerful way of getting quality clusters using known data distribution models. In Mahout, the algorithm is a pluggable framework, so different models can be created and tested. As the models become more complex there’s a chance of things slowing down on huge data sets, and at this point you’ll have to fall back on other clustering algorithms. But after seeing the output of Dirichlet cluster-
ing, you can clearly decide whether the algorithm we choose should be fuzzy or rigid, overlapping or hierarchical, whether the distance measure should be Manhattan or cosine, and what the threshold for convergence should be. Dirichlet clustering is both a data-understanding tool and a great data clustering tool.
bin/mahout dirichlet -i mahout/reuters-vectors/tfidf-vectors -o mahout/reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSparseVector
相关推荐
mahout0.11版本,源码,可修改源码并自己编译,使用java语言编写,maven编译
欢迎使用Apache Mahout! Apache Mahout:trade_mark:项目的目标是构建一个环境,以快速创建可扩展的高性能机器学习应用程序。 有关Mahout的其他信息,请访问设置环境无论您是使用Mahoutshell,运行命令行作业还是将...
Mahout支持K-Means等聚类算法,在此zip包中已经有打好jar包的资源,不需要用户再打jar包,可以直接使用。
Apache Mahout 项目旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout 的创始者 Grant Ingersoll 介绍了机器学习的基本概念,并演示了如何使用 Mahout 来实现文档集群、提出建议和组织内容。
mahout聚类算法的介绍,例如:Canopy,KMeans,Fuzzy-KMeans,Spectral Clustering等参数介绍和适用场景介绍
教你成功运行mahout的taste webapp例子,网上的很多资料说的不清楚,或者版本冲突。正确的版本是jdk1.6 maven3.0.5 mahout0.5 。 摸索良久,亲测有效!
mahout mahout机器智能推荐系统
深入解析Apache Mahout的书籍
If you are a Java ...Chapter 7: Clustering with Mahout Chapter 8: New Paradigm in Mahout Chapter 9: Case Study – Churn Analytics and Customer Segmentation Chapter 10: Case Study – Text Analytics
驯象师 mahout-推荐-测试 这是对 Mahout 推荐人的测试。 包含测试相似性和评估。 文档: : API: :
推荐搭配与Maven + hadoop和mahout一起推荐您可以从《行动中的Mahout》一书中了解更多信息。
Mahout:整体框架,实现了协同过滤 Deeplearning4j,构建VSM Jieba:分词,关键词提取 HanLP:分词,关键词提取 Spring Boot:提供API、ORM 关键实现 基于用户的协同过滤 直接调用Mahout相关接口即可 选择不同...
Hadoop-Mahout 使用 Mahout 在 Hadoop 上进行推荐、集群和分类
它大规模地处理了建议,聚类和分类机器学习问题。 到目前为止,在Ruby项目中很难使用它。 您必须自己在JRuby中实现Java接口,这并不是很快,特别是如果您刚刚开始探索机器学习的世界的话。 该库的目的是简化JRuby...
玩游戏 一个运行Apache Mahout方法的游乐场。
mahout是一款开源的机器学习算法,主要包括协同过滤推荐、聚类、分类等三大块内容。 推荐可以基于用户的推荐和基于物品的推荐,可以给用户推荐一些数据,智能化数据。 mahout描述 相关阅读 联系方式 以上观点纯属...
推荐系统使用ApacheMahout 使用Mahout库进行协同过滤。 使用的数据集: 100k电影镜头数据集。 网址: : 图书交叉数据集。 网址: : 数据预处理: 电影镜头数据集:该数据集的值用'\ t'分隔,并且还报告了时间戳。 ...
Apache Mahout: Beyond MapReduce. Distributed algorithm design This book is about designing mathematical and Machine Learning algorithms using the Apache Mahout "Samsara" platform. The material takes...
为了演示如何使用Mahout在EMR上进行分析工作,我们将构建电影推荐器。 我们将从GroupLens团队编译的MovieLens数据集中用户对电影标题的评级开始,并使用“基于建议”的示例为每个用户查找最受推荐的电影。 在CLI中,...