Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-1018

pushing down group-by before joins

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Queries with both Group-by and Joins are very common and they are expensive operations. By default Hive evalutes Joins first then group-by. Sometimes it is possible to rewrite queries to apply group-by (or map-side partial group by) first before join. This will remove a lot of duplicated keys in joins and alleviate skewness in join keys for this case. This rewrite should be cost-based. Before we have the stats and the CB framework, we can give users hints to do the rewrite.

      A particular case is where the join keys are the same as the grouping keys. Or the group keys is a superset of the join keys (so that grouping won't affect the result of joins).

      Examples:

      – Q1
      select A.key, B.key
      from A join B on (A.key=B.key)
      group by A.key, B.key;

      --Q2
      select distinct A.key, B.key
      from A join B on (A.key=B.key);

      --Q3, aggregation function is sum, count, min, max, (avg and median cannot be handled).
      selec A.key, sum(A.value), count(1), min(value), max(value)
      from A left semi join B on (A.key=B.key)
      group by A.key;

      – Q4. grouping keys is a superset of join keys
      select distinct A.key, A.value
      from A join B on (A.key=B.key)

      In the case of join keys are not a subset of grouping keys, we can introduce a map-side partial grouping operator with the keys of the UNION of the join and grouping keys, to remove unnecessary duplications. This should be cost-based though.

      Any thoughts and suggestions?

      Attachments

        Activity

          People

            Unassigned Unassigned
            nzhang Ning Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: