Hive
  1. Hive
  2. HIVE-219

Map-side aggregates output one row per reducer when not grouping

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 0.3.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      Example: SELECT count(1) FROM table;

        Issue Links

          Activity

          Hide
          David Phillips added a comment -

          With map-side:

          hive> set hive.map.aggr=true;
          hive> select count(1) from search_result;
          Total MapReduce jobs = 1
          Number of reducers = 7
          ...
          OK
          4012275
          4011646
          4011059
          4008870
          4011555
          4014719
          4013710
          Time taken: 169.683 seconds
          

          Without:

          hive> set hive.map.aggr=false;
          hive> select count(1) from search_result;
          Total MapReduce jobs = 2
          Number of reducers = 7
          ...
          OK
          28083834
          Time taken: 204.804 seconds
          
          Show
          David Phillips added a comment - With map-side: hive> set hive.map.aggr=true; hive> select count(1) from search_result; Total MapReduce jobs = 1 Number of reducers = 7 ... OK 4012275 4011646 4011059 4008870 4011555 4014719 4013710 Time taken: 169.683 seconds Without: hive> set hive.map.aggr=false; hive> select count(1) from search_result; Total MapReduce jobs = 2 Number of reducers = 7 ... OK 28083834 Time taken: 204.804 seconds
          Hide
          David Phillips added a comment -

          This issue cannot be caught by the current unit tests as they use the local runner. Should we have a test driver like TestCliDriver that uses MiniMRCluster?

          Show
          David Phillips added a comment - This issue cannot be caught by the current unit tests as they use the local runner. Should we have a test driver like TestCliDriver that uses MiniMRCluster?
          Hide
          Joydeep Sen Sarma added a comment -

          yeah - definitely need to use miniMR - there's one (or more) Jiras on this already ..

          Show
          Joydeep Sen Sarma added a comment - yeah - definitely need to use miniMR - there's one (or more) Jiras on this already ..
          Hide
          Ashish Thusoo added a comment -

          yes we should move to miniMR. The nfs related problem also seem to be related to what we do in MapRedTask in the local mode.

          Show
          Ashish Thusoo added a comment - yes we should move to miniMR. The nfs related problem also seem to be related to what we do in MapRedTask in the local mode.
          Hide
          Joydeep Sen Sarma added a comment -

          this is absolutely broken.

          i am trying count(1) with hive.map.aggr = true - and there is no map side aggregation happening (even though the explain has a map-side group by operator):

          Alias -> Map Operator Tree:
          mm_users_goodip_count
          Select Operator
          Group By Operator
          aggregations:
          expr: count(1)
          mode: hash
          Reduce Output Operator
          sort order:
          Map-reduce partition columns:
          expr: rand()
          type: double
          tag: -1
          value expressions:
          expr: 0
          type: bigint

          it seems that the groupbyDesc doe not have a 'keys' field specified (in other map side aggregates - i can see the keys specified).

          At any rate - the mapper emits one output row for each input row in this case. This is completely broken ..

          Show
          Joydeep Sen Sarma added a comment - this is absolutely broken. i am trying count(1) with hive.map.aggr = true - and there is no map side aggregation happening (even though the explain has a map-side group by operator): Alias -> Map Operator Tree: mm_users_goodip_count Select Operator Group By Operator aggregations: expr: count(1) mode: hash Reduce Output Operator sort order: Map-reduce partition columns: expr: rand() type: double tag: -1 value expressions: expr: 0 type: bigint it seems that the groupbyDesc doe not have a 'keys' field specified (in other map side aggregates - i can see the keys specified). At any rate - the mapper emits one output row for each input row in this case. This is completely broken ..
          Hide
          Namit Jain added a comment -

          marking duplicate of 256

          Show
          Namit Jain added a comment - marking duplicate of 256

            People

            • Assignee:
              Namit Jain
              Reporter:
              David Phillips
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development