zz:
http://issues.apache.org/jira/browse/MAHOUT-180
1. hadoop version of the
lanczos algorithm for performing
SVD on
sparse matrices.对sparse有高性能
2.the primary work to do parallized Lanczos is parallelized multiplication of (the square of) your input matrix by vectors.
the input matrix lives in HDFS, and then lanczos SVD method just leaves your matrix in HDFS(
which means the input matrix in distributed stored, and no additional data transfer) and sends one vector at a time to do parallelized matrix*vector
主要的工作就是matrix*vector的相乘,有时候是(the square of the matrix)*vector:M^TM*Vector
the work also avoid squaring the input matrix when your input matrix is symmetric
试
如果矩阵是对称的,它不会帮你squared,如果不是对称的,它首先帮你squared。
3. the author work on unit testing shows that lanczos is doing great.好
4.get SparseVectorsFromSequenceFiles:
$HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i text_path -o corpus_as_vectors_path -seq true -w tfidf -chunk 1000 --minSupport 1 --minDF 5 --maxDFPercent 50 --norm 2
do distributed lanczos solve to calculate singular value
$HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver -i corpus_as_vectors_path -o corpus_svd_path -nr 1 -nc <numFeatures> --rank 100
仔细看包含这个内容的帖子,特别是下面一部分disiredRank是什么意思
5.EigenVerificationJob可以去掉不好的eigenvalue
6。Multiplication of a matrix (or the square of a matrix) by a vector is the primary operation of Lanczos, and that is done in a M/R iteration.
If you want the top-k singular vectors, you make k passes over the data.
7.the code seems to be working fine and indeed produces the right amount of dense (eigen?) vectors.
分享到:
相关推荐
mahout0.11版本,源码,可修改源码并自己编译,使用java语言编写,maven编译
mahout-examples-0.11.1 mahout-examples-0.11.1-job mahout-h2o_2.10-0.11.1 mahout-h2o_2.10-0.11.1-dependency-reduced mahout-hdfs-0.11.1 mahout-integration-0.11.1 mahout-math-0.11.1 mahout-math-0.11.1 ...
svd算法的工具类,直接调用出结果,调用及设置方式参考http://blog.csdn.net/fansy1990 <mahout源码分析之DistributedLanczosSolver(七)>
mahout-integration-0.7mahout-integration-0.7mahout-integration-0.7mahout-integration-0.7
mahout-core-0.9.jar+mahout-core-0.8.jar+mahout-core-0.1.jar
mahout是用来做大数据推荐系统和机器学习使用的框架,这个工具包官网下载非常慢,下载了一夜终于下载到了,刚好够上传的
maven_mahout_template-mahout-0.8
官方下载的mahout-distribution-0.9.tar.gz 因为下载速度实在太慢,所以分享出来,方便大家下载使用。mahout-distribution-0.9.tar.gz
mahout-distribution-0.5-src.zip mahout 源码包
Apache Mahout: Beyond MapReduce. Distributed algorithm design This book is about ...Appendix A Mahout Book Conventions Appendix B In-core Algebra Reference Appendix C Distributed Algebra Reference
mahout-distribution-0.9-src.zip
重新编译mahout-examples-0.9-job.jar,增加分类指标:最小最大精度、召回率。详情见http://blog.csdn.net/u012948976/article/details/50203249
spring-mahout-demo-----一个简单的spring-mahout结合的例子,是很好的学习开发思路的例子。
教你成功运行mahout的taste webapp例子,网上的很多资料说的不清楚,或者版本冲突。正确的版本是jdk1.6 maven3.0.5 mahout0.5 。 摸索良久,亲测有效!
mahout实战 源码 mahout实战 配套 mahout-distribution-0.5.tar.gz 版本
mahout-math-0.8.jar mahout-math-0.8.jar
mahout0.8版的源代码~ 包括 core example等
mahout-0.9-cdh5.5.0.tar.gz
mahout-examples-0.10.1-job.jar 已经包含分词程序,替换掉mahout默认的jar包
mahout-distribution-0.10.0-src.tar.gz