Hive
  1. Hive
  2. HIVE-4827

Merge a Map-only task to its child task

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.0
    • Fix Version/s: 0.12.0
    • Component/s: Query Processor
    • Labels:
      None
    • Release Note:
      Hide
      Before applying this jira to trunk, CommonJoinTaskDispatcher has two methods, mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. The first method tries to merge a map-only task (for MapJoin) to its child map-only task. The second method tries to merge a map-only task to its child MapReduce task (a task has a reducer). There was a flag called "hive.optimize.mapjoin.mapreduce" to determine if mergeMapJoinTaskWithMapReduceTask will be called.

      This work combines mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. So, a map-only task will be merged into its child task no matter the child task is a map-only task or a MapReduce task. So hive.optimize.mapjoin.mapreduce is not needed any more.

      If a user wants to disable merging a map-only task to its child task, he or she can use either
      set hive.auto.convert.join.noconditionaltask=false;
      or
      set hive.auto.convert.join.noconditionaltask=true;
      set hive.auto.convert.join.noconditionaltask.size=0;
      Show
      Before applying this jira to trunk, CommonJoinTaskDispatcher has two methods, mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. The first method tries to merge a map-only task (for MapJoin) to its child map-only task. The second method tries to merge a map-only task to its child MapReduce task (a task has a reducer). There was a flag called "hive.optimize.mapjoin.mapreduce" to determine if mergeMapJoinTaskWithMapReduceTask will be called. This work combines mergeMapJoinTaskWithChildMapJoinTask and mergeMapJoinTaskWithMapReduceTask. So, a map-only task will be merged into its child task no matter the child task is a map-only task or a MapReduce task. So hive.optimize.mapjoin.mapreduce is not needed any more. If a user wants to disable merging a map-only task to its child task, he or she can use either set hive.auto.convert.join.noconditionaltask=false; or set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=0;

      Description

      When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a Map-only job (MapJoin) to its following MapReduce job. But this merge only happens when the MapReduce job has a single input. With Correlation Optimizer (HIVE-2206), it is possible that the MapReduce job can have multiple inputs (for multiple operation paths). It is desired to improve CommonJoinResolver to merge a Map-only job to the corresponding Map task of the MapReduce job.

      Example:

      set hive.optimize.correlation=true;
      set hive.auto.convert.join=true;
      set hive.optimize.mapjoin.mapreduce=true;
      SELECT tmp1.key, count(*)
      FROM (SELECT x1.key1 AS key
            FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
            GROUP BY x1.key1) tmp1
      JOIN (SELECT x2.key2 AS key
            FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
            GROUP BY x2.key2) tmp2
      ON (tmp1.key = tmp2.key)
      GROUP BY tmp1.key;
      

      In this query, join operations inside tmp1 and tmp2 will be converted to two MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of tmp1 and tmp2, and the last aggregation will be executed in the same MapReduce job (Reduce side). Since this MapReduce job has two inputs, right now, CommonJoinResolver cannot attach two MapJoins to the Map side of a MapReduce job.

      Another example:

      SELECT tmp1.key
      FROM (SELECT x1.key2 AS key
            FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
            UNION ALL
            SELECT x2.key2 AS key
            FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
      

      For this case, we will have three Map-only jobs (two for MapJoins and one for Union). It will be good to use a single Map-only job to execute this query.

      1. HIVE-4827.8.patch
        505 kB
        Yin Huai
      2. HIVE-4827.7.patch
        505 kB
        Yin Huai
      3. HIVE-4827.6.patch
        333 kB
        Yin Huai
      4. HIVE-4827.5.patch
        333 kB
        Yin Huai
      5. HIVE-4827.4.patch
        336 kB
        Yin Huai
      6. HIVE-4827.3.patch
        214 kB
        Yin Huai
      7. HIVE-4827.2.patch
        167 kB
        Yin Huai
      8. HIVE-4827.1.patch
        109 kB
        Yin Huai

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Yin Huai
              Reporter:
              Yin Huai
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development