`

hive on tez hive运行在tez之上 安装测试

阅读更多

hive on tez详细配置和运行测试

tez hadoop hive hdfs yarn


环境: hadoop-2.5.2 hive-0.14 tez-0.5.3 
hive on tez 的方式有两种安装配置方式:

  1. 在hadoop中配置
  2. 在hive中配置

比较: 
第二种方式:当已经有了稳定的hadoop集群,而不想动这个集群时,可以考虑采用第二种方式配置,第二种方式配置后只有hive的程序可以动态的切换执行引擎:set hive.execution.engine=mr;// tez/mr ;而其他的mapreduce程序只能在yarn上运行; 
第一种方式:侵入行较强,对原有的hadoop集群有影响,需要在hadoop的mapred-site.xml中配置:mapreduce.framework.name为yarn-tez,如果这样配置则意味着所有的通过本hadoop集群执行的mr任务都只能走tez方式提交任务,配置好后,hive默认的也就运行在tez上而不用其他的配置; 
以因此,在刚开始,想找到第二种的配置方式走了很多弯路

在开始前需要自己编译tez源码 此处略过

 
  1. root@localhost:/opt/work# wget http://www.eu.apache.org/dist/tez/0.5.3/apache-tez-0.5.3-src.tar.gz
  2. root@localhost:/opt/work# tar zxvf apache-tez-0.5.3-src.tar.gz
  3. root@localhost:/opt/work# cd apache-tez-0.5.3-src
  4. root@localhost:/opt/work/apache-tez-0.5.3-src# mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true //编译过程漫长啊,等待…..,中途有错误可以终止后再次执行mvn命令多次编译,编译成功之后目录结构如下
  5. root@localhost:/opt/work/apache-tez-0.5.3-src# ll
  6. 总用量204
  7. drwxrwxr-x 15500500409652616:29./
  8. drwxr-xr-x 38 root root 409652616:34../
  9. -rw-rw-r--1500500575312506:25 BUILDING.txt
  10. -rw-rw-r--15005006019912506:25 CHANGES.txt
  11. drwxrwxr-x 4500500409652616:30 docs/
  12. -rw-rw-r--15005006612506:25.gitignore
  13. lrwxrwxrwx 15005003312506:25 INSTALL.md -> docs/src/site/markdown/install.md
  14. -rw-rw-r--15005001447012506:25 KEYS
  15. -rw-rw-r--15005001135812506:25 LICENSE.txt
  16. -rw-rw-r--150050016412506:25 NOTICE.txt
  17. -rw-rw-r--15005003420312506:25 pom.xml
  18. -rw-rw-r--1500500143312506:25 README.md
  19. drwxr-xr-x 3 root root 409652616:29 target/
  20. drwxrwxr-x 4500500409652616:29 tez-api/
  21. drwxrwxr-x 4500500409652616:29 tez-common/
  22. drwxrwxr-x 4500500409652616:29 tez-dag/
  23. drwxrwxr-x 4500500409652616:30 tez-dist/
  24. drwxrwxr-x 4500500409652616:29 tez-examples/
  25. drwxrwxr-x 4500500409652616:29 tez-mapreduce/
  26. drwxrwxr-x 5500500409652616:29 tez-plugins/
  27. drwxrwxr-x 4500500409652616:29 tez-runtime-internals/
  28. drwxrwxr-x 4500500409652616:29 tez-runtime-library/
  29. drwxrwxr-x 4500500409652616:29 tez-tests/
  30. drwxrwxr-x 3500500409612506:25 tez-tools/
  31. root@localhost:/opt/work/apache-tez-0.5.3-src# ll tez-dist/target/
  32. 总用量40444
  33. drwxr-xr-x 5 root root 409652616:30./
  34. drwxrwxr-x 4500500409652616:30../
  35. drwxr-xr-x 2 root root 409652616:30 archive-tmp/
  36. drwxr-xr-x 2 root root 409652616:30 maven-archiver/
  37. drwxr-xr-x 3 root root 409652616:30 tez-0.5.3/
  38. -rw-r--r--1 root root 1062599552616:30 tez-0.5.3-minimal.tar.gz
  39. -rw-r--r--1 root root 3075712852616:30 tez-0.5.3.tar.gz
  40. -rw-r--r--1 root root 279152616:30 tez-dist-0.5.3-tests.jar
  41. root@localhost:/opt/work/apache-tez-0.5.3-src#

编译后的tez-dist/target/tez-0.5.3.tar.gz 就是我们需要的tez组件的二进制包,并将tez-0.5.3.tar.gz上传到hdfs的一个目录中:

 
  1. [hadoop@mymaster local]$ hadoop fs -mkdir /apps
  2. [hadoop@mymaster local]$ hadoop fs -copyFromLocal tez-0.5.3.tar.gz /apps/
  3. [hadoop@mymaster local]$ hadoop fs -ls /apps
  4. SLF4J:Class path contains multiple SLF4J bindings.
  5. SLF4J:Found binding in[jar:file:/oneapm/local/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  6. SLF4J:Found binding in[jar:file:/oneapm/local/tez-0.5.3/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  7. SLF4J:See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  8. SLF4J:Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
  9. Found1 items
  10. -rw-r--r--2 hadoop supergroup 307571282015-05-2616:53/apps/tez-0.5.3.tar.gz
  11. [hadoop@mymaster local]$

之后需要在hadoop的master节点上的$HADOOP_HOME/etc/hadoop/目录下创建tez-site.xml文件,内容如下:

 
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <configuration>
  4. <property>
  5. <name>tez.lib.uris</name>
  6. <value>${fs.defaultFS}/apps/tez-0.5.3.tar.gz</value>
  7. </property>
  8. </configuration>

之上所作的都是必须的步骤,接下来分别描述hive on tez 的两种配置方式

 

第一种方式:在hadoop中配置

需要将tez的jar包加到$HADOOP_CLASSPATH路径下,在hadoop_env.sh文件的末尾,添加如下内容:

 
  1. export TEZ_HOME=/oneapm/local/tez-0.5.3#是你的tez的解压目录
  2. for jar in`ls $TEZ_HOME |grep jar`;do
  3. export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$TEZ_HOME/$jar
  4. done
  5. for jar in`ls $TEZ_HOME/lib`;do
  6. export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$TEZ_HOME/lib/$jar
  7. done

修改mapred-site.xml 文件

 
  1. <property>
  2. <name>mapreduce.framework.name</name>
  3. <value>yarn-tez</value>
  4. </property>

修改之后将mapred-site.xml和hadoop_env.sh,tez-site.xml文件同步到集群所有的节点上,这里将会影响到整个集群,是我不想这么做的原因. 
运行tez的实例mr程序,验证是否安装成功:

 
  1. [hadoop@mymaster tez-0.5.3]$ hadoop jar $TEZ_HOME/tez-examples-0.5.3.jar orderedwordcount /license.txt /out

当然license.txt 请自行准备上传到hdfs即可,如果运行顺利,查看8088端口如下: 
此处输入图片的描述
箭头所示的application type为TEZ,表示安装成功

 

第二种方式:在hive中配置

第二种配置开始前,请将第一步的步骤取消,保证hadoop的配置文件恢复到原状,tez-site.xml文件只放在master一台节点上即可;

将tez下的jar和tez下的lib下的jar包复制到hive的$HIVE_HOME/lib目录下即可 
配置过程中,我的hive和hadoop的master在同一个节点上,以hadoop用户启动运行hive,tez/mr一切顺利,但是考虑到与master放在一个节点运行 master节点物理资源不足,所以将hive同样的配置迁移到另一台干净的主机hiveclient上:运行hive on mr任务顺利;运行hive ont tez就不行,错误如下:

 
  1. hive (default)>set hive.execution.engine=tez;
  2. hive (default)>select json_udtf(data)from tpm.tps_dc_metricdata where pt=2015060200 limit 1;
  3. Query ID = blueadmin_20150603130202_621abba7-850e-4683-8331-aee8482f2ebe
  4. Total jobs =1
  5. LaunchingJob1out of 1
  6. FAILED:ExecutionError,return code 1from org.apache.hadoop.hive.ql.exec.tez.TezTask
  7. hive (default)>

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask 
引起这个错误的原因很多,只从这里看不出来到底是哪里有问题, 只能看hive的运行job日志了,日志在你的HIVEHOME/confhivelog4j.propertieshive.log.dir={java.io.tmpdir}/user.name,使,/tmp/{user}/目录下生成hive的job日志和运行日志,在log中看到如下的信息:

 
  1. 2015-06-0313:03:01,071 INFO [main]: tez.DagUtils(DagUtils.java:createLocalResource(718))-Resource modification time:1433307781075
  2. 2015-06-0313:03:01,126 ERROR [main]:exec.Task(TezTask.java:execute(184))-Failed to execute tez graph.
  3. java.io.FileNotFoundException:File does not exist: hdfs:/user/hivetest
  4. at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1072)
  5. at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1064)
  6. at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  7. at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1064)
  8. at org.apache.hadoop.hive.ql.exec.tez.DagUtils.getDefaultDestDir(DagUtils.java:774)
  9. at org.apache.hadoop.hive.ql.exec.tez.DagUtils.getHiveJarDirectory(DagUtils.java:870)
  10. at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.createJarLocalResource(TezSessionState.java:337)
  11. at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:158)
  12. at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:234)
  13. at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)
  14. at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783)
  15. at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
  16. at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
  17. at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethod)
  18. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  19. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  20. at java.lang.reflect.Method.invoke(Method.java:606)
  21. at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
  22. 2015-06-0313:03:01,127 ERROR [main]: ql.Driver(SessionState.java:printError(833))- FAILED:ExecutionError,return code 1from org.apache.hadoop.hive.ql.exec.tez.Tez
  23. Task

说hdfs没有/user/hivetest 目录,确实,我的hiveclient主机上运行的hive是以hivetest用户运行的,它在hdfs上没有自己的home目录,那么没有目录,就创建目录:

 
  1. [hivetest@mymaster tez-0.5.3]$hadoop fs -mkdir /user/hivetest

如此依赖问题解决,重新进入hive即可,接下来为hive on tez/yarn的初步测试结果

 

启动hive运行测试

 
  1. hive (default)>set hive.execution.engine=tez;
  2. hive (default)>select t.a,count(1)from(select split(data,'\t')[1] a,split(data,'\t')[2] b from tpm.tps_dc_metricdata limit 1000) t groupby t.a ;
  3. Query ID = hadoop_20150526184141_556cf5d8-edf3-430a-b21a-513c35679567
  4. Total jobs =1
  5. LaunchingJob1out of 1
  6. Tez session was closed.Reopening...
  7. Session re-established.
  8. Status:Running(Executing on YARN cluster withApp id application_1432632452478_0005)
  9. --------------------------------------------------------------------------------
  10. VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
  11. --------------------------------------------------------------------------------
  12. Map1.......... SUCCEEDED 26260000
  13. Reducer2...... SUCCEEDED 110000
  14. Reducer3...... SUCCEEDED 110000
  15. --------------------------------------------------------------------------------
  16. VERTICES:03/03[==========================>>]100% ELAPSED TIME:24.60 s
  17. --------------------------------------------------------------------------------
  18. OK
  19. t.a _c1
  20. 117
  21. 107
  22. 1003
  23. 1016
  24. 1051
  25. 1172
  26. 1182
  27. 11911
  28. 123
  29. 1203
  30. 1211
  31. 1234
  32. 1244
  33. 12516
  34. 1264
  35. 1279
  36. 12910
  37. 1426
  38. 2211
  39. 此处省略n条打印记录
  40. Time taken:30.637 seconds,Fetched:207 row(s)

set hive.execution.engine=tez; 即执行引擎为tez 如果想用yarn,则设置为:set hive.execution.engine=mr;即可 
tez执行过程中有个已经很漂亮的进度条,如上所示; 执行查询1000条记录

hive on yarn

 
  1. hive (tpm)>set hive.execution.engine=mr;
  2. hive (tpm)>select t.a,count(1)from(select split(data,'\t')[1] a,split(data,'\t')[2] b from tpm.tps_dc_metricdata limit 1000) t groupby t.a ;
  3. Query ID = hadoop_20150526140606_d73156e0-c81c-4b2a-bfb6-fd1d48fa8325
  4. Total jobs =2
  5. LaunchingJob1out of 2
  6. Number of reduce tasks determined at compile time:1
  7. In order to change the average load for a reducer (in bytes):
  8. set hive.exec.reducers.bytes.per.reducer=<number>
  9. In order to limit the maximum number of reducers:
  10. set hive.exec.reducers.max=<number>
  11. In order to set a constant number of reducers:
  12. set mapreduce.job.reduces=<number>
  13. StartingJob= job_1432521221608_0008,Tracking URL = http://mymaster:8088/proxy/application_1432521221608_0008/
  14. KillCommand=/oneapm/local/hadoop-2.5.2/bin/hadoop job -kill job_1432521221608_0008
  15. Hadoop job information forStage-1: number of mappers:70; number of reducers:1
  16. 2015-05-2614:06:53,584Stage-1 map =0%, reduce =0%
  17. 2015-05-2614:07:13,931Stage-1 map =1%, reduce =0%,Cumulative CPU 3.46 sec
  18. 2015-05-2614:07:15,004Stage-1 map =9%, reduce =0%,Cumulative CPU 21.37 sec
  19. 2015-05-2614:07:18,198Stage-1 map =12%, reduce =0%,Cumulative CPU 43.02 sec
  20. 2015-05-2614:07:19,260Stage-1 map =19%, reduce =0%,Cumulative CPU 47.7 sec
  21. 2015-05-2614:07:20,322Stage-1 map =20%, reduce =0%,Cumulative CPU 48.52 sec
  22. 省略打印
  23. OK
  24. t.a _c1
  25. 115
  26. 108
  27. 1001
  28. 1013
  29. 1022
  30. 1045
  31. 1051
  32. 1063
  33. 10710
  34. 1092
  35. 省略打印
  36. Time taken:152.971 seconds,Fetched:207 row(s)

本次测试结果:on tez比on yarn上快大约5倍左右的速度;

对多个hive stage的sql优化显著,测试结果根据不同的平台可能有不同程度的差异

总结: 1.根据如上第二种的配置,集群默认的还是yarn,hive可以在mr和tez之间自由切换而对原有的hadoop mr任务没有影响,还是yarn,运行的状态可以下8088端口下看,hive的命令行终端运行tez是的进度条挺漂亮;

 

参考

  • 大小: 195.1 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics