Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4139

[Hive] multi group by statement is not optimized

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      A simple multi-group by statement is not optimized. A simple statement like:

      FROM SRC
      INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct SUBSTR(SRC.value,4)) GROUP BY SRC.key
      INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct SUBSTR(SRC.value,4)) GROUP BY SRC.key;

      results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated.
      The first step can be common to all group bys.

        Attachments

        1. patch1
          28 kB
          Namit Jain
        2. patch3
          29 kB
          Namit Jain
        3. patch4.txt
          30 kB
          Namit Jain

          Activity

            People

            • Assignee:
              namit Namit Jain
              Reporter:
              namit Namit Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: