Hive
  1. Hive
  2. HIVE-1772

optimize join followed by a groupby

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      explain SELECT x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) group by x.key;

      STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-2 depends on stages: Stage-1
      Stage-0 is a root stage

      The above query issues 2 map-reduce jobs.
      The first MR job performs the join, whereas the second MR performs the group by.
      Since the data is already sorted, the group by can be performed in the reducer of the join itself.

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Yin Huai added a comment -

          With HIVE-2206, cases mentioned in this jira will be optimized by Correlation Optimizer. Please check correlationoptimizer1.q to correlationoptimizer14.q for unit test queries of HIVE-2206.correlationoptimizer1.q is one of the most relevant files.

          I am closing this jira as "Not A Problem"

          Show
          Yin Huai added a comment - With HIVE-2206 , cases mentioned in this jira will be optimized by Correlation Optimizer. Please check correlationoptimizer1.q to correlationoptimizer14.q for unit test queries of HIVE-2206 . correlationoptimizer1.q is one of the most relevant files. I am closing this jira as "Not A Problem"
          Hide
          Navis added a comment -

          @Radhika Malik
          I thought YSMART(HIVE-2206) seemed to be merged shortly so I abandoned this. But you can continue if you want.

          Show
          Navis added a comment - @Radhika Malik I thought YSMART( HIVE-2206 ) seemed to be merged shortly so I abandoned this. But you can continue if you want.
          Hide
          Radhika Malik added a comment -

          A group of us is trying to do this for a class project. We want to parallelize the process of JOIN followed by GROUP BY as follows-
          The Map job is the same: it takes in two TableScanOperators (as well as any FilterOperators) as well as two ReduceSinkOperators.
          The Reduce job, while computing the joins in the JoinOperator also groups the results and performs any aggregates. It then pushes the results directly to a FileSinkOperator without having a separate GroupByOperator.

          Does anyone have suggestions on where we can get started in the code? Looking at Hive's architecture overview, it seems we want to make changes to the Query Plan Generator in the compiler to generate different map-reduce tasks for queries that include Join followed by Group By. We are thinking of beginning with trying to modify src/ql/src/java/org/apache/hadoop/hive/ql/QueryPlan.java but weren't sure if this was the right approach. Any input on how you think we should approach this would be great!

          Show
          Radhika Malik added a comment - A group of us is trying to do this for a class project. We want to parallelize the process of JOIN followed by GROUP BY as follows- The Map job is the same: it takes in two TableScanOperators (as well as any FilterOperators) as well as two ReduceSinkOperators. The Reduce job, while computing the joins in the JoinOperator also groups the results and performs any aggregates. It then pushes the results directly to a FileSinkOperator without having a separate GroupByOperator. Does anyone have suggestions on where we can get started in the code? Looking at Hive's architecture overview, it seems we want to make changes to the Query Plan Generator in the compiler to generate different map-reduce tasks for queries that include Join followed by Group By. We are thinking of beginning with trying to modify src/ql/src/java/org/apache/hadoop/hive/ql/QueryPlan.java but weren't sure if this was the right approach. Any input on how you think we should approach this would be great!
          Hide
          Navis added a comment -

          initial patch.. dependent to HIVE-2339

          Show
          Navis added a comment - initial patch.. dependent to HIVE-2339

            People

            • Assignee:
              Unassigned
              Reporter:
              Namit Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development