有个参数sequential决定是否本地执行,这里只讲MapReduce执行。
源代码如下,
1
2 3 4 5 6 7 8 9 10 11 12 |
private boolean runMapReduce(Map< string , List< String > > parsedArgs) throws IOException,
InterruptedException, ClassNotFoundException { Path model = new Path(getOption("model")); HadoopUtil.cacheFiles(model, getConf()); //the output key is the expected value, the output value are the scores for all the labels Job testJob = prepareJob(getInputPath(), getOutputPath(), SequenceFileInputFormat.class, BayesTestMapper.class, Text.class, VectorWritable.class, SequenceFileOutputFormat.class); boolean complementary = parsedArgs.containsKey("testComplementary"); testJob.getConfiguration().set(COMPLEMENTARY, String.valueOf(complementary)); boolean succeeded = testJob.waitForCompletion(true); return succeeded; } |
首先从训练的模型中得到model,实例化model,也就是将写入的vectors重新读取出来罢了。
testJob只用到了map阶段,如下
1
2 3 4 5 |
protected void map(Text key, VectorWritable value, Context context) throws IOException, InterruptedException {
Vector result = classifier.classifyFull(value.get()); //the key is the expected value context.write(new Text(key.toString().split("/")[1]), new VectorWritable(result)); } |
输出的key就是类别的text,value就是输入的向量在每个类的得分。
classifier.classifyFull()计算输入的向量在每个label的得分:
1
2 3 4 5 6 7 |
getScoreForLabelInstance如下,计算此label下的feature得分和。
1
2 3 4 5 6 7 8 9 10 |
getScoreForLabelFeature有两种计算方式,
1, 标准bayes ,log[(Wi+alphai)/(ƩWi + N)]
1
2 3 4 5 6 7 8 9 10 11 12 |
public double getScoreForLabelFeature(int label, int feature) {
NaiveBayesModel model = getModel(); return computeWeight(model.weight(label, feature), model.labelWeight(label), model.alphaI(), model.numFeatures()); } public static double computeWeight(double featureLabelWeight, double labelWeight, double alphaI, double numFeatures) { double numerator = featureLabelWeight + alphaI; double denominator = labelWeight + alphaI * numFeatures; return Math.log(numerator / denominator); } |
2, complementary bayes,也就是计算除此类之外的其他类的值。
1
2 3 4 5 6 7 8 9 10 11 12 |
//complementary bayes
public double getScoreForLabelFeature(int label, int feature) { NaiveBayesModel model = getModel(); return computeWeight(model.featureWeight(feature), model.weight(label, feature), model.totalWeightSum(), model.labelWeight(label), model.alphaI(), model.numFeatures()); } public static double computeWeight(double featureWeight, double featureLabelWeight, double totalWeight, double labelWeight, double alphaI, double numFeatures) { double numerator = featureWeight - featureLabelWeight + alphaI; double denominator = totalWeight - labelWeight + alphaI * numFeatures; return -Math.log(numerator / denominator); } |
最后就是analyze了,对每个key,通过score vector得到最大值,与label index比较。产生confusion matrix了。
http://hnote.org/big-data/mahout/mahout-testnaivebayesdriver-testnb
相关推荐
mahoutAlgorithms源码分析 mahout代码解析
Mahout是一个Java的机器学习库。Mahout的完整源代码,基于maven,可以轻易导入工程中
Mahout教程内含源码以及说明书可以自己运行复现.zip
mahout,朴素贝叶斯分类,中文分词,mahout,朴素贝叶斯分类,中文分词,
mahout0.9的源码,支持hadoop2,需要自行使用mvn编译。mvn编译使用命令: mvn clean install -Dhadoop2 -Dhadoop.2.version=2.2.0 -DskipTests
mahout-distribution-0.5-src.zip mahout 源码包
Mahout in Action 源码,结合Mahout in Action 学习数据挖掘,比较容易理解
该资源是mahout in action 中的源码,适用于自学,可在github下载:https://github.com/tdunning/MiA
mahout 数据挖掘 数据分析 开源 hadoop
mahout实战 源码 mahout实战 配套 mahout-distribution-0.5.tar.gz 版本
【甘道夫】通过Mahout构建贝叶斯文本分类器案例详解 -- 配套源码
Thank you for requesting the download for Apache Mahout Cookbook. Please click the following link to download the code:
mahout0.11版本,源码,可修改源码并自己编译,使用java语言编写,maven编译
mahout 0.7 src, mahout 源码包, hadoop 机器学习子项目 mahout 源码包
svd算法的工具类,直接调用出结果,调用及设置方式参考http://blog.csdn.net/fansy1990 <mahout源码分析之DistributedLanczosSolver(七)>
Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现,包括聚类、分类、推荐过滤...
Mahout:整体框架,实现了协同过滤 Deeplearning4j,构建VSM Jieba:分词,关键词提取 HanLP:分词,关键词提取 Spring Boot:提供API、ORM 关键实现 基于用户的协同过滤 直接调用Mahout相关接口即可 选择不同...
maven_mahout_template-mahout-0.8
mahout_help,mahout的java api帮助文档,可以帮你更轻松掌握mahout