Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-2206

add a new optimizer for query correlation discovery and optimization

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.0
    • Fix Version/s: 0.12.0
    • Component/s: Query Processor
    • Labels:
      None
    • Release Note:
      This optimizer exploits the intra-query correlations and merge multiple correlated MapReduce jobs into one jobs.

      Description

      This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.

      Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.

      1. Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
      2. Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
      3. Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

      The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.

      1. There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
      2. All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and
      3. No self join is involved in those correlated MR jobs.

      Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

      Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers.

      There are several work that can be done in future to improve this optimizer. Here are three examples.

      1. Support queries only involve TC;
      2. Support queries in which input tables of correlated MR jobs involves intermediate tables; and
      3. Optimize queries involving self join.

      References:
      Paper and presentation of YSmart.
      Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
      Slides: http://sdrv.ms/UpwJJc

        Attachments

        1. HIVE-2206.D11097.22.patch
          5 kB
          Phabricator
        2. HIVE-2206.D11097.21.patch
          5 kB
          Phabricator
        3. HIVE-2206.D11097.20.patch
          46 kB
          Phabricator
        4. HIVE-2206.patch
          1.20 MB
          Yin Huai
        5. HIVE-2206.D11097.19.patch
          1.20 MB
          Phabricator
        6. HIVE-2206.D11097.18.patch
          1.20 MB
          Phabricator
        7. HIVE-2206.D11097.17.patch
          1.19 MB
          Phabricator
        8. HIVE-2206.D11097.16.patch
          1.20 MB
          Phabricator
        9. HIVE-2206.D11097.15.patch
          1.10 MB
          Phabricator
        10. HIVE-2206.D11097.14.patch
          1.09 MB
          Phabricator
        11. HIVE-2206.D11097.13.patch
          1.03 MB
          Phabricator
        12. HIVE-2206.D11097.12.patch
          1003 kB
          Phabricator
        13. HIVE-2206.D11097.11.patch
          7 kB
          Phabricator
        14. HIVE-2206.D11097.10.patch
          980 kB
          Phabricator
        15. HIVE-2206.D11097.9.patch
          923 kB
          Phabricator
        16. HIVE-2206.D11097.8.patch
          868 kB
          Phabricator
        17. HIVE-2206.D11097.7.patch
          725 kB
          Phabricator
        18. HIVE-2206.D11097.6.patch
          698 kB
          Phabricator
        19. HIVE-2206.D11097.5.patch
          582 kB
          Phabricator
        20. HIVE-2206.D11097.4.patch
          581 kB
          Phabricator
        21. HIVE-2206.D11097.3.patch
          581 kB
          Phabricator
        22. HIVE-2206.D11097.2.patch
          586 kB
          Phabricator
        23. HIVE-2206.D11097.1.patch
          756 kB
          Phabricator
        24. HIVE-2206.20-r1434012.patch.txt
          512 kB
          Yin Huai
        25. HIVE-2206.19-r1410581.patch.txt
          508 kB
          Yin Huai
        26. HIVE-2206.18-r1407720.patch.txt
          491 kB
          Yin Huai
        27. HIVE-2206.17-r1404933.patch.txt
          491 kB
          Yin Huai
        28. HIVE-2206.16-r1399936.patch.txt
          492 kB
          Yin Huai
        29. HIVE-2206.15-r1392491.patch.txt
          492 kB
          Yin Huai
        30. HIVE-2206.14-r1389704.patch.txt
          499 kB
          Yin Huai
        31. HIVE-2206.13-r1389072.patch.txt
          500 kB
          Yin Huai
        32. HIVE-2206.12-r1386996.patch.txt
          308 kB
          Yin Huai
        33. HIVE-2206.11-r1385084.patch.txt
          250 kB
          Yin Huai
        34. HIVE-2206.10-r1384442.patch.txt
          341 kB
          Yin Huai
        35. HIVE-2206.8-r1237253.patch.txt
          225 kB
          Yin Huai
        36. testQueries.2.q
          5 kB
          Yin Huai
        37. HIVE-2206.8.r1224646.patch.txt
          219 kB
          Yin Huai
        38. HIVE-2206.7.patch.txt
          221 kB
          Yin Huai
        39. HIVE-2206.6.patch.txt
          156 kB
          Yin Huai
        40. HIVE-2206.5-1.patch.txt
          209 kB
          Yin Huai
        41. HIVE-2206.5.patch.txt
          255 kB
          Yin Huai
        42. HIVE-2206.4.patch.txt
          190 kB
          Yin Huai
        43. HIVE-2206.3.patch.txt
          190 kB
          Yin Huai
        44. HIVE-2206.2.patch.txt
          190 kB
          Yin Huai
        45. HIVE-2206.1.patch.txt
          190 kB
          Yin Huai
        46. YSmartPatchForHive.patch
          251 kB
          He Yongqiang

          Issue Links

            Activity

              People

              • Assignee:
                yhuai Yin Huai
                Reporter:
                he yongqiang He Yongqiang
              • Votes:
                0 Vote for this issue
                Watchers:
                41 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: