wbj0110

浏览: 1556693 次
性别:
来自: 上海

最近访客更多访客>>

一往无前bhz

ninja2006

loginboot

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

hive日志分析实战（二）

博客分类：

Hive

hive

需求

统计某游戏平台新用户渠道来源

日志格式如下：

Text代码  
Jul 23 0:00:47  [info] {SPR}gjzq{SPR}20130723000047{SPR}85493108{SPR}S1{SPR}{SPR}360wan-2j-reg{SPR}58.240.209.78{SPR}  

分析

问题的关键在于先找出新用户

新用户：仅在7月份登陆过平台的用户为新用户

依据map/reduce思想，可以按照如下方式找出新用户：

假如某用户在某月份出现过，则（qid，year，month）=1
按qid汇总该用户出现过的月数，即构建(qid,count(year,month))对
新用户的count（year，month）=1，且(year,month)=(2013,07).

找出新用户的来源渠道

来源渠道：新用户在201307可能多次登录平台，需要找出最早登陆平台所属渠道

分两步来做：

找出新用户所有登陆记录（qid,logintime,src）
针对同一qid找出logintime最小时的src

实现

数据准备

1）建表

Sql代码  
create table if not exists glogin_daily (year int,month int,day int,hour int,logintime string,qid int,gkey string,skey string,loginip string,registsrc string,loginfrom string) partitioned by (dt string);  

依据日志内容及所关心的信息创建表格，按天组织分区

2 ) 数据导入

因日志文件存在于多处，我们先将日志汇总到一临时目录，创建临时外部表将数据加载进hive，然后通过正则匹配的方式分隔出各字段。（内部表只能load单文件，通过这种方式可以load文件夹）

Shell代码  
echo "==== load data into tmp table $TMP_TABLE ==="  
    ${SOFTWARE_BASE}/hive/bin/hive -e "create external table $TMP_TABLE (info string) location '${TMP_DIR}';"  
    echo "==== M/R ==="  
    CURR_YEAR=`echo $CURR_DOING|cut -b 1-4`  
    CURR_MONTH=`echo $CURR_DOING|cut -b 5-6`  
    CURR_DAY=`echo $CURR_DOING|cut -b 7-8`  
    dt="${CURR_YEAR}-${CURR_MONTH}-${CURR_DAY}"  
    ${SOFTWARE_BASE}/hive/bin/hive -e "add file ${SCRIPT_PATH}/${MAP_SCRIPT_FILE};set hive.exec.dynamic.partition=true;insert overwrite table glogin_daily partition (dt='${dt}') select transform (t.i) using '$MAP_SCRIPT_PARSER ./${MAP_SCRIPT_FILE}' as (y,m,d,h,t,q,g,s,ip,src,f) from (select info as i from ${TMP_TABLE}) t;"  

其中filter_login.php:

Php代码  
$fr=fopen("php://stdin","r");  
$month_dict = array(  
    'Jan' => 1,  
    'Feb' => 2,  
    'Mar' => 3,  
    'Apr' => 4,  
    'May' => 5,  
    'Jun' => 6,  
    'Jul' => 7,  
    'Aug' => 8,  
    'Sep' => 9,  
    'Oct' => 10,  
    'Nov' => 11,  
    'Dec' => 12,  
);  
while(!feof($fr))  
{  
    $input = fgets($fr,256);  
    $input = rtrim($input);  
    //Jul 23 0:00:00  [info] {SPR}xxj{SPR}20130723000000{SPR}245396389{SPR}S9{SPR}iwan-ng-mnsg{SPR}cl-reg-xxj0if{SPR}221.5.67.136{SPR}  
    if(preg_match("/([^ ]+) +(\d+) (\d+):.*\{SPR\}([^\{]*)\{SPR\}(\d+)\{SPR\}(\d+)\{SPR\}([^\{]*)\{SPR\}([^\{]*)\{SPR\}(([^\{]*)\{SPR\}([^\{]*)\{SPR\})?/",$input,$matches))  
    {  
       $year = substr($matches[5],0,4);  
       echo $year."\t".$month_dict[$matches[1]]."\t".$matches[2]."\t".$matches[3]."\t".$matches[5]."\t".$matches[6]."\t".$matches[4]."\t".$matches[7]."\t".$matches[11]."\t".$matches[8]."\t".$matches[10]."\n";  
    }  
}  
fclose ($fr);  

2.找出新用户

1)用户登陆平台记录按月消重汇总

Sql代码  
create table distinct_login_monthly_tmp_07 as select qid,year,month from glogin_daily group by qid,year,month;  

2)用户登陆平台月数

Sql代码  
create table login_stat_monthly_tmp_07 as select qid,count(1) as c from distinct_login_monthly_tmp_07 where year<2013 or (year=2013 and month<=7) group by qid;   

平台级新用户：
1)找出登陆月数为1的用户;

2.判断这些用户是否在7月份出现，如果有出现，找出登陆所有src

Sql代码  
create table new_player_monthly_07 as select distinct a.qid,b.src,b.logintime from (select qid from login_stat_monthly_tmp_07 where c=1) a join (select qid,loginfrom as src,logintime from glogin_daily where month=7 and year=2013) b on a.qid=b.qid;  

找出最早登陆的src:

Sql代码  
add file /home/game/lvbenwei/load_login/get_player_src.php;  
create table new_player_src_07 as select transform (t.qid,t.src,t.logintime) using 'php ./get_player_src.php' as (qid,src,logintime) from (select * from new_player_monthly_07 order by qid,logintime) t;  

其中get_player_src.php:

Php代码  
$fr=fopen("php://stdin","r");  
$curr_qid = null;  
$curr_src = null;  
$curr_logintime=null;  
while(!feof($fr))  
{  
    $input = fgets($fr,1024);  
    $input = rtrim($input);  
    $arr   = explode("\t", $input);  
    $qid   = trim($arr[0]);  
        if(emptyempty($curr_qid)||$curr_qid != $qid)  
        {  
                $curr_qid = $qid;  
                echo $input."\n";  
        }  
}  
fclose ($fr);  

平台级新用户数：

Sql代码  
select count(*) from new_player_src_07;  

平台级各渠道新用户汇总：

Sql代码  
create table new_player_src_stat_07 as select src,count(*) from new_player_monthly_07 group by src;  

http://godlovesdog.iteye.com/blog/1926259

分享到：

hive日志分析实战（一） | hadoop学习--基于Hive的Hadoop日志分析

2014-09-08 14:19
浏览 734
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive日志分析实战（二）

需求

分析

实现

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive日志分析实战（二）

需求

分析

实现

评论

发表评论

相关推荐

基于hive的日志分析系统

HIVE 处理日志，自定义inputformat 完整版

Using Hive for Data Analysis

hadoop学习--基于Hive的Hadoop日志分析

hive日志分析实战（一）

hive支持sql大全

hive支持sql大全

hive导入nginx日志

分别使用Hadoop MapReduce、hive统计手机流量

Hive getstarted

Hive如何加载和导入HBase的数据

数据仓库工具

FACEBOOK架构

HBase入门篇（转）

HBase/Hadoop学习笔记 (转)

运行MapReduce作业做集成测试

Hadoop分布式文件系统：架构和设计要点(转)

hadoop中的数据序列化及数据类型

GitHub项目Storm-HBase介绍

HBase Thrift 接口的一些使用问题及相关注意事项

最近访客更多访客>>