[Mahout] 第一个小实验：使用GroupLens进行推荐模型的检验

RangerWolf

浏览: 232921 次
性别:
来自: 南京

最近访客更多访客>>

dazhou

xubukang

minxiaomin

qihongce

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

数据挖掘
Mahout
Java

注：内容参考至《Mahout实战》

根据mahout实战里面的内容，接下来将使用grouplens提供的movielens-1m的数据进行推荐。

在mahout自带的example之中，已经有了能读取dat文件的代码。其扩展至FileDataModel，因此拿过来就能直接用了。但是由于考虑到机器性能的原因，我会丢弃掉部分数据，减小运算的数据量~

改造主要就是在参数之中增加了一个removeRatio参数，在读取文件的时候根据这个随机数进行随机的丢弃掉部分数据。

下面就是我稍微改造的GroupLensDataModel.java

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.net.URL;
import java.util.Random;
import java.util.regex.Pattern;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.common.iterator.FileLineIterable;

import com.google.common.base.Charsets;
import com.google.common.io.Closeables;
import com.google.common.io.Files;
import com.google.common.io.InputSupplier;
import com.google.common.io.Resources;

public final class GroupLensDataModel extends FileDataModel {
  
  private static final String COLON_DELIMTER = "::";
  private static final Pattern COLON_DELIMITER_PATTERN = Pattern.compile(COLON_DELIMTER);
  
  /**
   * 
   * @param ratingsFile ratingsFile GroupLens ratings.dat file in its native format
   * @param removeRatio try to make target file size small by random drop data
   * @throws IOException IOException if an error occurs while reading or writing files
   */
  public GroupLensDataModel(File ratingsFile, double removeRatio) throws IOException {
	    super(convertGLFile(ratingsFile, removeRatio));
	  }
  
  /**
   * 
   * @param originalFile
   * @param ratio will remove part of target records
   * @return
   * @throws IOException
   */
  private static File convertGLFile(File originalFile, double ratio) throws IOException {
	    // Now translate the file; remove commas, then convert "::" delimiter to comma
	    File resultFile = new File(new File(System.getProperty("java.io.tmpdir")), "ratings.txt");
	    
	    if (resultFile.exists()) {
	      resultFile.delete();
	    }
	    Writer writer = null;
	    try {
	      writer = new OutputStreamWriter(new FileOutputStream(resultFile), Charsets.UTF_8);
	      Random rand = new Random();
	      for (String line : new FileLineIterable(originalFile, false)) {
	    	if(rand.nextDouble() > ratio) {
	    		int lastDelimiterStart = line.lastIndexOf(COLON_DELIMTER);
		        if (lastDelimiterStart < 0) {
		          throw new IOException("Unexpected input format on line: " + line);
		        }
		        String subLine = line.substring(0, lastDelimiterStart);
		        String convertedLine = COLON_DELIMITER_PATTERN.matcher(subLine).replaceAll(",");
		        writer.write(convertedLine);
		        writer.write('\n');
	    	}
	      }
	    } catch (IOException ioe) {
	      resultFile.delete();
	      throw ioe;
	    } finally {
	      Closeables.close(writer, false);
	    }
	    return resultFile;
	  }

  public static File readResourceToTempFile(String resourceName) throws IOException {
    InputSupplier<? extends InputStream> inSupplier;
    try {
      URL resourceURL = Resources.getResource(GroupLensDataModel.class, resourceName);
      inSupplier = Resources.newInputStreamSupplier(resourceURL);
    } catch (IllegalArgumentException iae) {
      File resourceFile = new File("src/main/java" + resourceName);
      inSupplier = Files.newInputStreamSupplier(resourceFile);
    }
    File tempFile = File.createTempFile("taste", null);
    tempFile.deleteOnExit();
    Files.copy(inSupplier, tempFile);
    return tempFile;
  }

  @Override
  public String toString() {
    return "GroupLensDataModel";
  }
  
}

下面就是主程序：

import java.io.File;
import java.io.IOException;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;


public class TestGroupLens {

	public static void main(String[] args) {
		// load data set
		try {
			DataModel model = new GroupLensDataModel(new File("E:\\DataSet\\ml-1m\\ratings.dat"), 0.5);
			RecommenderEvaluator evaluator = 
					new AverageAbsoluteDifferenceRecommenderEvaluator();
			RecommenderBuilder builder = new RecommenderBuilder() {
				@Override
				public Recommender buildRecommender(DataModel dataModel)
						throws TasteException {
					UserSimilarity sim = new PearsonCorrelationSimilarity(dataModel);
					UserNeighborhood nbh = new NearestNUserNeighborhood(30, sim, dataModel);
					// 生成推荐引擎
					Recommender rec = new GenericUserBasedRecommender(dataModel, nbh, sim);
					return rec;
				}
			}; 
			double score = evaluator.evaluate(builder, null, model, 0.7, 0.3);
			System.out.println(score);
		} catch (IOException e) {
			e.printStackTrace();
		} catch (TasteException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} 
		
	}
	
}

运行的结果在0.85左右。

跟书上提供的结果0.89稍微有点差距

0
顶

0
踩

分享到：

[Mahout] Windows + Eclipse 构建mahout运 ... | [Mahout] Windows下Mahout单机安装

2014-07-06 15:29
浏览 4680
评论(0)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

[Mahout] 第一个小实验：使用GroupLens进行推荐模型的检验

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

[Mahout] 第一个小实验：使用GroupLens进行推荐模型的检验

评论

发表评论

相关推荐

[Lucene] Lucene 4.10 显示分词结果

[Hadoop] 分布式Join : Replicated Join

[Hadoop]使用Hadoop进行ReduceSideJoin

[Hadoop] Hadoop 链式任务 : ChainMapper and ChainReducer的使用

[Hadoop] 练习：使用Hadoop计算两个向量的内积

[Hadoop] TopK的一个简单实现

[Mahout] 使用Mahout 对Kddcup 1999的数据进行分析 -- Naive Bayes

[Mahout] 为什么mahout需要sequencefile ?

[Mahout] mahout 0.9 的 seqdirectory 有bug

[Mahout] 使用Mahout对iris数据进行分析 - Logistic Regression

[Mahout] Windows + Eclipse 构建mahout运行环境

[Mahout] Windows下Mahout单机安装

[Kaggle实战] Titanic 逃生预测 (5) - 使用Dot语言绘制决策树

[Kaggle实战] Titanic 逃生预测 (4) - 决策树建模

[Kaggle实战] Titanic 逃生预测 (3) - Age离散化

[Kaggle实战] Titanic 逃生预测 (2) - 数据预处理

[Kaggle实战] Titanic 逃生预测 (1) - 项目起步

Java实现的朴素贝叶斯分类器

使用Java实现的ID3算法

Weka 界面操作实例

最近访客更多访客>>