Hive
  1. Hive
  2. HIVE-522

GenericUDAF: Extend UDAF to deal with complex types

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We can pass arbitrary arguments into GenericUDFs. We should do the same thing to GenericUDAF so that UDAF can also take arbitrary arguments.

      1. HIVE-522.10.patch
        602 kB
        Zheng Shao
      2. HIVE-522.9.patch
        609 kB
        Zheng Shao
      3. HIVE-522.8.patch
        608 kB
        Zheng Shao
      4. HIVE-522.6.patch
        322 kB
        Zheng Shao
      5. HIVE-522.5.patch
        599 kB
        Zheng Shao
      6. HIVE-522.4.patch
        397 kB
        Zheng Shao
      7. HIVE-522.3.patch
        424 kB
        Zheng Shao
      8. HIVE-522.2.patch
        379 kB
        Zheng Shao
      9. HIVE-522.1.patch
        37 kB
        Zheng Shao

        Activity

        Hide
        Namit Jain added a comment -

        Committed. Thanks Zheng

        Show
        Namit Jain added a comment - Committed. Thanks Zheng
        Hide
        Zheng Shao added a comment -

        Forgot to overwrite some of the test results.

        Show
        Zheng Shao added a comment - Forgot to overwrite some of the test results.
        Hide
        Zheng Shao added a comment -

        Merged with trunk again.

        Some other changes after reviewing with Namit:
        1. Remove debug message in Driver.java
        2. Using LazySimpleSerDe for map/reduce boundary value part, and intermediate files
        3. Removed PrimitiveConverter classes that is already added in the code base in another file.

        Show
        Zheng Shao added a comment - Merged with trunk again. Some other changes after reviewing with Namit: 1. Remove debug message in Driver.java 2. Using LazySimpleSerDe for map/reduce boundary value part, and intermediate files 3. Removed PrimitiveConverter classes that is already added in the code base in another file.
        Hide
        Zheng Shao added a comment -

        Merged with trunk again.

        Show
        Zheng Shao added a comment - Merged with trunk again.
        Hide
        Zheng Shao added a comment -

        Merged with trunk again and fixed all tests.

        Show
        Zheng Shao added a comment - Merged with trunk again and fixed all tests.
        Hide
        Zheng Shao added a comment -

        Overwrited all test failures.

        Show
        Zheng Shao added a comment - Overwrited all test failures.
        Hide
        Zheng Shao added a comment -

        Resolved conflicts with recent trunk changes.

        Show
        Zheng Shao added a comment - Resolved conflicts with recent trunk changes.
        Hide
        Zheng Shao added a comment -

        This patch merged all recent trunk changes and fixed all test errors.

        Show
        Zheng Shao added a comment - This patch merged all recent trunk changes and fixed all test errors.
        Hide
        Zheng Shao added a comment -

        This patch passes all tests.

        Note that there was a bug in the "count" UDAF in treating empty strings as nulls. This patch also fixes this problem.
        Also, the SerDe for temporary files between different jobs are changed to LazySimpleSerDe, because DynamicSerDe is not capable of understanding sub-structs (GenericUDAFAverage's partial aggregation results are a struct of a "long" count and a "double" sum).

        Show
        Zheng Shao added a comment - This patch passes all tests. Note that there was a bug in the "count" UDAF in treating empty strings as nulls. This patch also fixes this problem. Also, the SerDe for temporary files between different jobs are changed to LazySimpleSerDe, because DynamicSerDe is not capable of understanding sub-structs (GenericUDAFAverage's partial aggregation results are a struct of a "long" count and a "double" sum).
        Hide
        Zheng Shao added a comment -

        A preliminary patch that includes all new classes. They are not integrated with GroupByOperator yet but the integration work is pretty straight-forward.

        GenericUDAF is more complex than I thought at first. So I've created a bunch of classes for it:

        1, GenericUDAFResolver: takes a function name and the list of parameter TypeInfo and returns a GenericUDAFEvaluator.
        2. GenericUDAFEvaluator: allows 2 things:
        2.1 Create a new aggregation result buffer
        2.2 Update an aggregation result buffer, or terminate the aggregation and get the results.
        3. The aggregation result buffer in step 2 is an interface. Each GenericUDAFEvaluator should have its own aggregation result buffer class to store the data (for example, a count for count(), a count and a sum for average()).

        1 is used at compile time. 2 and 3 are at runtime.

        The reason that I split 2 and 3 is:
        A. It shrinks the size of the aggregation result buffer size - only a "long" is needed for count. (input's ObjectInspector and output writable Object (e.g. Long or LongWritable of count()) are both stored in GenericUDAFEvaluator).
        B. It makes it easier to move to HIVE-535: A3 in the future.

        Show
        Zheng Shao added a comment - A preliminary patch that includes all new classes. They are not integrated with GroupByOperator yet but the integration work is pretty straight-forward. GenericUDAF is more complex than I thought at first. So I've created a bunch of classes for it: 1, GenericUDAFResolver: takes a function name and the list of parameter TypeInfo and returns a GenericUDAFEvaluator. 2. GenericUDAFEvaluator: allows 2 things: 2.1 Create a new aggregation result buffer 2.2 Update an aggregation result buffer, or terminate the aggregation and get the results. 3. The aggregation result buffer in step 2 is an interface. Each GenericUDAFEvaluator should have its own aggregation result buffer class to store the data (for example, a count for count(), a count and a sum for average()). 1 is used at compile time. 2 and 3 are at runtime. The reason that I split 2 and 3 is: A. It shrinks the size of the aggregation result buffer size - only a "long" is needed for count. (input's ObjectInspector and output writable Object (e.g. Long or LongWritable of count()) are both stored in GenericUDAFEvaluator). B. It makes it easier to move to HIVE-535 : A3 in the future.

          People

          • Assignee:
            Zheng Shao
            Reporter:
            Zheng Shao
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development