用hive+hdfs+sqoop分析日志的步骤

akingde

浏览: 296826 次
性别:
来自: 北京

最近访客更多访客>>

wuajohn

u012363178

xiaomizhg

痞夫balabala

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

大数据技术（hadoop）

现在的部分工作是进行日志分析，由于每天的日志压缩前80多G左右，用lzop压缩后10G左右，如果用shell直接进行统计，需要花费很长时间才能完成，而且还需要用java函数对request url进行转换，于是采用hive+hdfs+sqoop方案进行日志统计分析

hadoop+hive+hdfs+sqoop的架构就不详细说了，可以直接用cloudera的repo直接安装

日志分析步骤

一下载服务器中的日志，因为应用服务使用了多台服务器，所以需要对日志进行合并整理，然后用lzop进行压缩

二在hive中创建表

hive>CREATE TABLE maptile (ipaddress STRING,identity STRING,user STRING,time STRING,method STRING,request STRING,protocol STRING,status STRING,size STRING,referer STRING,agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\"[^ ]*) ([^ ]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s")STORED AS TEXTFILE;

hive>CREATE TABLE maptile (ipaddress STRING,identity STRING,user STRING,time STRING,method STRING,request STRING,protocol STRING,status STRING,size STRING,referer STRING,agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\"[^ ]*) ([^ ]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s")STORED AS TEXTFILE;

三导入日志数据

hive>load data local inpath '/home/log/1.lzo' overwrite into table maptile;

hive>load data local inpath '/home/log/1.lzo' overwrite into table maptile;

四在hive中创建日志统计后结果表

hive>create table result (ip string,num int) partitioned by (dt string);

hive>create table result (ip string,num int) partitioned by (dt string);

五统计日志并将统计结果插入到新表中

hive>insert overwrite table result partition (dt='2011-09-22') select ipaddress,count(1) as numrequest from maptile group by ipaddress sort by numrequest desc;

hive>insert overwrite table result partition (dt='2011-09-22') select ipaddress,count(1) as numrequest from maptile group by ipaddress sort by numrequest desc;

六将统计结果导出到mysql中

sqoop export --connect jdbc:mysql://localhost:3306/result --username root --password admin --table ip_info --export-dir /user/hive/warehouse/result/dt=2011-09-22 --input-fields-terminated-by '\001'

sqoop export --connect jdbc:mysql://localhost:3306/result --username root --password admin --table ip_info --export-dir /user/hive/warehouse/result/dt=2011-09-22 --input-fields-terminated-by '\001'

以上步骤可以写入到shell script中设置定时任务自动完成

分享到：

httpclient编码问题 | 实例详细说明linux下去除重复行命令uniq

2013-01-19 23:28
浏览 1959
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

用hive+hdfs+sqoop分析日志的步骤

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

用hive+hdfs+sqoop分析日志的步骤

评论

发表评论

相关推荐

hive优化总结

The Hadoop Distributed File System

Hadoop Shell 讲解

Hadoop FS Shell命令讲解

hadoop初步认识

最近访客更多访客>>