Hive
  1. Hive
  2. HIVE-522

GenericUDAF: Extend UDAF to deal with complex types

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We can pass arbitrary arguments into GenericUDFs. We should do the same thing to GenericUDAF so that UDAF can also take arbitrary arguments.

      1. HIVE-522.9.patch
        609 kB
        Zheng Shao
      2. HIVE-522.8.patch
        608 kB
        Zheng Shao
      3. HIVE-522.6.patch
        322 kB
        Zheng Shao
      4. HIVE-522.5.patch
        599 kB
        Zheng Shao
      5. HIVE-522.4.patch
        397 kB
        Zheng Shao
      6. HIVE-522.3.patch
        424 kB
        Zheng Shao
      7. HIVE-522.2.patch
        379 kB
        Zheng Shao
      8. HIVE-522.10.patch
        602 kB
        Zheng Shao
      9. HIVE-522.1.patch
        37 kB
        Zheng Shao

        Activity

        Zheng Shao created issue -
        Zheng Shao made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Release Note HIVE-523. FIx PartitionPruner not to fetch all partitions at once. (Prasad Chakka via zshao)
        Fix Version/s 0.4.0 [ 12313714 ]
        Resolution Fixed [ 1 ]
        Zheng Shao made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Zheng Shao made changes -
        Hadoop Flags [Reviewed]
        Release Note HIVE-523. FIx PartitionPruner not to fetch all partitions at once. (Prasad Chakka via zshao)
        Hide
        Zheng Shao added a comment -

        A preliminary patch that includes all new classes. They are not integrated with GroupByOperator yet but the integration work is pretty straight-forward.

        GenericUDAF is more complex than I thought at first. So I've created a bunch of classes for it:

        1, GenericUDAFResolver: takes a function name and the list of parameter TypeInfo and returns a GenericUDAFEvaluator.
        2. GenericUDAFEvaluator: allows 2 things:
        2.1 Create a new aggregation result buffer
        2.2 Update an aggregation result buffer, or terminate the aggregation and get the results.
        3. The aggregation result buffer in step 2 is an interface. Each GenericUDAFEvaluator should have its own aggregation result buffer class to store the data (for example, a count for count(), a count and a sum for average()).

        1 is used at compile time. 2 and 3 are at runtime.

        The reason that I split 2 and 3 is:
        A. It shrinks the size of the aggregation result buffer size - only a "long" is needed for count. (input's ObjectInspector and output writable Object (e.g. Long or LongWritable of count()) are both stored in GenericUDAFEvaluator).
        B. It makes it easier to move to HIVE-535: A3 in the future.

        Show
        Zheng Shao added a comment - A preliminary patch that includes all new classes. They are not integrated with GroupByOperator yet but the integration work is pretty straight-forward. GenericUDAF is more complex than I thought at first. So I've created a bunch of classes for it: 1, GenericUDAFResolver: takes a function name and the list of parameter TypeInfo and returns a GenericUDAFEvaluator. 2. GenericUDAFEvaluator: allows 2 things: 2.1 Create a new aggregation result buffer 2.2 Update an aggregation result buffer, or terminate the aggregation and get the results. 3. The aggregation result buffer in step 2 is an interface. Each GenericUDAFEvaluator should have its own aggregation result buffer class to store the data (for example, a count for count(), a count and a sum for average()). 1 is used at compile time. 2 and 3 are at runtime. The reason that I split 2 and 3 is: A. It shrinks the size of the aggregation result buffer size - only a "long" is needed for count. (input's ObjectInspector and output writable Object (e.g. Long or LongWritable of count()) are both stored in GenericUDAFEvaluator). B. It makes it easier to move to HIVE-535 : A3 in the future.
        Zheng Shao made changes -
        Attachment HIVE-522.1.patch [ 12409731 ]
        Zheng Shao made changes -
        Comment [ Interface for GenericUDAF.

        {code:java}
        interface GenericUDAF {

          enum Mode {
            /** Partial: from original data to partial aggregation data: iterate() and terminatePartial() will be called */
            PARTIAL,
            /** Merge: from partial aggregation to full aggregation: merge() and terminate() will be called */
            MERGE,
            /** Full: from original data directly to full aggregation: merge() and terminate() will be called */
            FULL
          };

          /** Initialize the aggregation.
           * @param m The mode of aggregation.
           * @param parameters The ObjectInspector for the parameters:
           * In PARTIAL and FULL mode, the parameters are original data;
           * In MERGE mode, the parameters are just partial aggregations (in that case, the array will always have a single element).
           * @return The ObjectInspector for the return value.
           * In PARTIAL mode, the ObjectInspector for the return value of terminatePartial() call;
           * In MERGE and FULL mode, the ObjectInspector for the return value of terminate() call.
           */
          ObjectInspector init(Mode m, ObjectInspector[] parameters);

          /** Iterate through raw data.
           * @param parameters The objects of parameters.
           */
          void iterate(Object[] parameters);

          /** Get partial aggregation result.
           * @return partial aggregation result.
           */
          Object terminatePartial();

          /** Merge with partial aggregation result.
           * @param partial The partial aggregation result.
           */
          void merge(Object partial);

          /** Get final aggregation result.
           * @return final aggregation result.
           */
          Object terminate();

        }

        {code}

        ]
        Zheng Shao made changes -
        Comment [ closed the wrong jira. ]
        Zheng Shao made changes -
        Comment [ Committed. Thanks Prasad. ]
        Hide
        Zheng Shao added a comment -

        This patch passes all tests.

        Note that there was a bug in the "count" UDAF in treating empty strings as nulls. This patch also fixes this problem.
        Also, the SerDe for temporary files between different jobs are changed to LazySimpleSerDe, because DynamicSerDe is not capable of understanding sub-structs (GenericUDAFAverage's partial aggregation results are a struct of a "long" count and a "double" sum).

        Show
        Zheng Shao added a comment - This patch passes all tests. Note that there was a bug in the "count" UDAF in treating empty strings as nulls. This patch also fixes this problem. Also, the SerDe for temporary files between different jobs are changed to LazySimpleSerDe, because DynamicSerDe is not capable of understanding sub-structs (GenericUDAFAverage's partial aggregation results are a struct of a "long" count and a "double" sum).
        Zheng Shao made changes -
        Attachment HIVE-522.2.patch [ 12410108 ]
        Zheng Shao made changes -
        Status Reopened [ 4 ] Patch Available [ 10002 ]
        Hide
        Zheng Shao added a comment -

        This patch merged all recent trunk changes and fixed all test errors.

        Show
        Zheng Shao added a comment - This patch merged all recent trunk changes and fixed all test errors.
        Zheng Shao made changes -
        Attachment HIVE-522.3.patch [ 12412571 ]
        Hide
        Zheng Shao added a comment -

        Resolved conflicts with recent trunk changes.

        Show
        Zheng Shao added a comment - Resolved conflicts with recent trunk changes.
        Zheng Shao made changes -
        Attachment HIVE-522.4.patch [ 12412660 ]
        Hide
        Zheng Shao added a comment -

        Overwrited all test failures.

        Show
        Zheng Shao added a comment - Overwrited all test failures.
        Zheng Shao made changes -
        Attachment HIVE-522.5.patch [ 12412674 ]
        Hide
        Zheng Shao added a comment -

        Merged with trunk again and fixed all tests.

        Show
        Zheng Shao added a comment - Merged with trunk again and fixed all tests.
        Zheng Shao made changes -
        Attachment HIVE-522.6.patch [ 12413091 ]
        Hide
        Zheng Shao added a comment -

        Merged with trunk again.

        Show
        Zheng Shao added a comment - Merged with trunk again.
        Zheng Shao made changes -
        Attachment HIVE-522.8.patch [ 12413329 ]
        Hide
        Zheng Shao added a comment -

        Merged with trunk again.

        Some other changes after reviewing with Namit:
        1. Remove debug message in Driver.java
        2. Using LazySimpleSerDe for map/reduce boundary value part, and intermediate files
        3. Removed PrimitiveConverter classes that is already added in the code base in another file.

        Show
        Zheng Shao added a comment - Merged with trunk again. Some other changes after reviewing with Namit: 1. Remove debug message in Driver.java 2. Using LazySimpleSerDe for map/reduce boundary value part, and intermediate files 3. Removed PrimitiveConverter classes that is already added in the code base in another file.
        Zheng Shao made changes -
        Attachment HIVE-522.9.patch [ 12413501 ]
        Hide
        Zheng Shao added a comment -

        Forgot to overwrite some of the test results.

        Show
        Zheng Shao added a comment - Forgot to overwrite some of the test results.
        Zheng Shao made changes -
        Attachment HIVE-522.10.patch [ 12413506 ]
        Hide
        Namit Jain added a comment -

        Committed. Thanks Zheng

        Show
        Namit Jain added a comment - Committed. Thanks Zheng
        Namit Jain made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Carl Steinbach made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Zheng Shao
            Reporter:
            Zheng Shao
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development