Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
0.14.0
-
None
-
None
-
RHEL 6.4
Hortonworks Hadoop 2.2
Description
When trying to run a simple aggregation query with hive.map.aggr=true, map tasks take a lot of time in Hive 0.14 as against with hive.map.aggr=false.
e.g.
Consider the query:
INSERT OVERWRITE TABLE lineitem_tgt_agg select alias.a0 as a0, alias.a2 as a1, alias.a1 as a2, alias.a3 as a3, alias.a4 as a4 from ( select alias.a0 as a0, SUM(alias.a1) as a1, SUM(alias.a2) as a2, SUM(alias.a3) as a3, SUM(alias.a4) as a4 from ( select lineitem_sf500.l_orderkey as a0, CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * (1 - lineitem_sf500.l_discount) * (1 + lineitem_sf500.l_tax) as double) as a1, lineitem_sf500.l_quantity as a2, CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_discount as double) as a3, CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_tax as double) as a4 from lineitem_sf500 ) alias group by alias.a0 ) alias;
The above query was run with ~376GB of data / ~3billion records in the source.
It takes ~10 minutes with hive.map.aggr=false.
With map side aggregation set to true, the map tasks don't complete even after an hour.
Attachments
Attachments
Issue Links
- duplicates
-
HIVE-11502 Map side aggregation is extremely slow
- Closed