`

如何 map 端 Join。

 
阅读更多
Hive 中  修改表的 rawDataSize = 1
14: jdbc:hive2://ark3:9994> alter table app_uuid_info_test  set tblproperties(rawDataSize=1)
14: jdbc:hive2://ark3:9994> ;
HBASE 表是不会根新的所有手工指点


这个 阀值
set spark.sql.autoBroadcastJoinThreshold=100000000;

JOINT  中  left  outer join app_uuid_info_test b

select   a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.value.int_20 from tmp.ods_app_hour a left  outer join app_uuid_info_test b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303'  and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100;


执行期间中 :  就是  MAP JOIN 了。


org.apache.spark.sql.execution.SparkStrategies类 决定是否使用broadcast join的逻辑在SparkStrategies类中,


object CanBroadcast { def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match { case BroadcastHint(p) => Some(p) case p if sqlContext. conf.autoBroadcastJoinThreshold > 0 && p.statistics.sizeInBytes <= sqlContext.conf.autoBroadcastJoinThreshold => Some(p) case _ => None } }




== Parsed Logical Plan ==
'GlobalLimit 100
+- 'LocalLimit 100
   +- 'Project ['a.tmp_id, 'a.uuid, 'a.eguan_id, 'a.device_id, 'b.row_key, 'b.value.int_20]
      +- 'Filter ((('a.app_id = 6069) && ('a.day = 20180303)) && ('a.tmp_id = 1a16d393e7042213384f994394b763d37121d7))
         +- 'Join LeftOuter, ('a.tmp_id = 'b.row_key)
            :- 'UnresolvedRelation `tmp`.`ods_app_hour`, a
            +- 'UnresolvedRelation `app_uuid_info_test`, b

== Analyzed Logical Plan ==
tmp_id: string, uuid: bigint, eguan_id: string, device_id: string, row_key: string, int_20: string
GlobalLimit 100
+- LocalLimit 100
   +- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
      +- Filter (((app_id#22014 = 6069) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
         +- Join LeftOuter, (tmp_id#22017 = row_key#22195)
            :- SubqueryAlias a
            :  +- SubqueryAlias ods_app_hour
            :     +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
            +- SubqueryAlias b
               +- MetastoreRelation default, app_uuid_info_test

== Optimized Logical Plan ==
GlobalLimit 100
+- LocalLimit 100
   +- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
      +- Join LeftOuter, (tmp_id#22017 = row_key#22195)
         :- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
         :  +- Filter (((((isnotnull(app_id#22014) && isnotnull(day#22015)) && isnotnull(tmp_id#22017)) && (app_id#22014 = 6069)) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
         :     +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
         +- MetastoreRelation default, app_uuid_info_test

== Physical Plan ==
CollectLimit 100
+- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
   +- *BroadcastHashJoin [tmp_id#22017], [row_key#22195], LeftOuter, BuildRight
      :- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
      :  +- *Filter (isnotnull(tmp_id#22017) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
      :     +- *FileScan parquet tmp.ods_app_hour[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,app_id#22014,day#22015,hour#22016] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#22014), isnotnull(day#22015), (app_id#22014 = 6069), (day#22015 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
         +- HiveTableScan [row_key#22195, value#22196], MetastoreRelation default, app_uuid_info_test
2018-03-04 19:45:47,554 INFO  [dispatcher-event-loop-10] storage.BlockManagerInfo: Removed broadcast_173_piece0 on 192.168.220.171:38696 in memory (size: 33.7 KB, free: 334.5 MB)


维表写法

select a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.int_20 from tmp.ods_app_hour a left outer join (select row_key,value.int_20 as int_20 from    app_uuid_info_test  where row_key in (01,02) )     b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303' and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100



== Physical Plan ==
CollectLimit 100
+- *BroadcastHashJoin [tmp_id#23309], [row_key#23487], LeftOuter, BuildRight
   :- *LocalLimit 100
   :  +- *Project [tmp_id#23309, uuid#23310L, eguan_id#23311, device_id#23312]
   :     +- *Filter (isnotnull(tmp_id#23309) && (tmp_id#23309 = 1a16d393e7042213384f994394b763d37121d7))
   :        +- *FileScan parquet tmp.ods_app_hour[tmp_id#23309,uuid#23310L,eguan_id#23311,device_id#23312,app_id#23306,day#23307,hour#23308] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#23306), isnotnull(day#23307), (app_id#23306 = 6069), (day#23307 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      +- *Project [row_key#23487, value#23488[int_20] AS int_20#22932]
         +- *Filter row_key#23487 IN (1,2)
            +- HiveTableScan [row_key#23487, value#23488], MetastoreRelation default, app_uuid_info_test


分享到:
评论

相关推荐

    hive 中join和Group的优化

    //是否在 Map 端进行聚合,默认为 True ;该设置会消耗更多的内存。 set hive.groupby.mapaggr.checkinterval = 100000000; //在 Map 端进行聚合操作的条目数目 set hive.groupby.skewindata = true; //解决数据...

    hadoop_join_aggregate:在hadoop中加入和聚合mapreduce算法

    Map side join 比 reducer side join 快。 但是只有当您执行映射端连接操作的表之一小到足以放入内存时,映射端连接才足够。 日期集信息 客户数据集:每行包含:CustID、Name、Age、CountryCode、Salary。 交易...

    Hadoop硬实战 [(美)霍姆斯著][电子工业出版社][2015.01]_PDF电子书下载 带书签目录 高清完整版.rar )

    技术点29 确定map 端数据倾斜问题 技术点30 判定map 任务吞吐量 技术点31 小文件 技术点32 不可切割的文件 6.2.3 reduce 端问题 技术点33 reducer 任务数过大或过小 . 技术点34 定位reduce 端数据倾斜...

    Hadoop实战(第2版)

    6.1.1 提取作业统计信息的工具6.1.2 监控6.2 确定性能问题的原因6.2.1 了解哪些因素会影响MapReduce 作业的性能 6.2.2 map 端异常技术点28 发现输入数据中的坑技术点29 确定map 端数据倾斜问题 技术...

    Hibernate注解

    * public Map&lt;“多端”做为Key的属性的类,主表类&gt; get“多端”列表(){return “多端”列表} * “多端”配置参考@ManyToOne. * 方法三 使用这种配置,在为“一端”添加“多端”时,可以修改“多端”的外键。 * ...

    word源码java-hadoop-test:hadoop、mapreduce的一些练习

    包org.dan.mr.order_pro_mapjoin MapReduce实现订单信息和产品信息的join逻辑,在Mapper端实现,避免数据倾斜 包org.dan.mr.wordindex MapReduce单词索引 包org.dan.mr.shared_friends MapReduce查找共同好友 包org....

    HiveQueryMRJoin:使用减少端连接将Hive查询转换为Java MapReduce

    map输出的键必须是join键减速器1.Reducer将使用通用密钥从所有文件中获取随机数据。 2.根据标签属性组合两个记录。 问题陈述 : 查找购买总金额以及每个客户的交易次数。 客户表将具有唯一的客户ID以及客户的其他...

    AisMVC.zip

    我们团队开发web项目一般采用前后端分离,所以后端的的Controller层的功能仅仅只有提供ajax接口,页面集成后的跳转,过滤器和拦截器,所以我就想着自己写一款仿springmvc的mvc框架作为自己和团队以后的开发中小型项目的...

    Hibernate中文详细学习文档

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...

    Hibernate参考文档

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...

    hibernate 体系结构与配置 参考文档(html)

    组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...

    Hibernate_3.2.0_符合Java习惯的关系数据库持久化

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...

    Hibernate+中文文档

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...

    Hibernate 中文 html 帮助文档

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...

    hibernate 框架详解

    组件作为Map的索引(Components as Map indices ) 9.4. 组件作为联合标识符(Components as composite identifiers) 9.5. 动态组件 (Dynamic components) 10. 继承映射(Inheritance Mappings) 10.1. 三种...

    HibernateAPI中文版.chm

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...

    hibernate3.2中文文档(chm格式)

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种...

    最全Hibernate 参考文档

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...

    Hibernate3+中文参考文档

    8.3. 组件作为Map的索引(Components as Map indices ) 8.4. 组件作为联合标识符(Components as composite identifiers) 8.5. 动态组件 (Dynamic components) 9. 继承映射(Inheritance Mappings) 9.1. 三种策略 ...

Global site tag (gtag.js) - Google Analytics