Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4139

[Hive] multi group by statement is not optimized

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.19.0
    • None
    • None
    • Reviewed

    Description

      A simple multi-group by statement is not optimized. A simple statement like:

      FROM SRC
      INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct SUBSTR(SRC.value,4)) GROUP BY SRC.key
      INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct SUBSTR(SRC.value,4)) GROUP BY SRC.key;

      results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated.
      The first step can be common to all group bys.

      Attachments

        1. patch4.txt
          30 kB
          Namit Jain
        2. patch3
          29 kB
          Namit Jain
        3. patch1
          28 kB
          Namit Jain

        Activity

          People

            namit Namit Jain
            namit Namit Jain
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: