Hive
  1. Hive
  2. HIVE-1738

Optimize Key Comparison in GroupByOperator

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      GroupByOperator uses ObjectInspectorUtils.compare() to compare keys, which is written for generalized object comparisons, which is not optimized for group-by operator. By optimizing this logic, we expect to see obvious improvements in GroupByOperator.

      1. HIVE.1738.3.patch
        12 kB
        Siying Dong
      2. HIVE.1738.2.patch
        11 kB
        Siying Dong
      3. HIVE.1738.1.patch
        11 kB
        Siying Dong

        Activity

        Hide
        Siying Dong added a comment -

        We handle the case for comparing different size because, (for some new codes checked in by Ning, as far as I remember), if map doesn't produce anything for group by, it created a dummy record whose length is 0. Just handle lengh=0 sounds strange, so I made a little bit more general to this way.

        Comparing from last field is definitely a good idea. It might help a lot in reduce side for multiple key group-by when there are lots of records to reduce. I'll open a new JIRA for it.

        Show
        Siying Dong added a comment - We handle the case for comparing different size because, (for some new codes checked in by Ning, as far as I remember), if map doesn't produce anything for group by, it created a dummy record whose length is 0. Just handle lengh=0 sounds strange, so I made a little bit more general to this way. Comparing from last field is definitely a good idea. It might help a lot in reduce side for multiple key group-by when there are lots of records to reduce. I'll open a new JIRA for it.
        Hide
        Namit Jain added a comment -

        Siying, can you file a follow-up jira for the above ?

        Show
        Namit Jain added a comment - Siying, can you file a follow-up jira for the above ?
        Hide
        Zheng Shao added a comment -

        +1. This is smart!

        > public boolean areEqual(ArrayList<Object> ol0, ArrayList<Object> ol1) ...
        Why do we need to care about the case that the 2 array lists are different in size / shorter than numFields?

        > for (int i = 0; i < numFields; i++) {

        We might want to try comparing the last field first. The reason is that in sort-based aggregation, the last key is more likely to be different than the first key. Not sure the effect is big enough to be noticeable though.

        Show
        Zheng Shao added a comment - +1. This is smart! > public boolean areEqual(ArrayList<Object> ol0, ArrayList<Object> ol1) ... Why do we need to care about the case that the 2 array lists are different in size / shorter than numFields? > for (int i = 0; i < numFields; i++) { We might want to try comparing the last field first. The reason is that in sort-based aggregation, the last key is more likely to be different than the first key. Not sure the effect is big enough to be noticeable though.
        Hide
        Namit Jain added a comment -

        Committed. Thanks Siying

        Show
        Namit Jain added a comment - Committed. Thanks Siying
        Hide
        Namit Jain added a comment -

        +1

        Show
        Namit Jain added a comment - +1
        Hide
        Siying Dong added a comment -

        Modify according to Namit's comments.

        Show
        Siying Dong added a comment - Modify according to Namit's comments.
        Hide
        Namit Jain added a comment - - edited
        • Also, for string and test elements, it performs slightly better than

        spelling: (should be Text)

        public ListObjectsEqualComparer(ObjectInspector[] oi0, ObjectInspector[] oi1) {
        assert(oi0.length == oi1.length);

        Instead of asserting, can you throw an error ?

        ListObjectsEqualComparer

        else {
        assert(type0.equals(type1));
        compareType = CompareType.SAME_TYPE;

        Dont assert same type ?
        types can be different - it wont happen for GroupBy

        Otherwise, it looks great

        Show
        Namit Jain added a comment - - edited Also, for string and test elements, it performs slightly better than spelling: (should be Text) public ListObjectsEqualComparer(ObjectInspector[] oi0, ObjectInspector[] oi1) { assert(oi0.length == oi1.length); Instead of asserting, can you throw an error ? ListObjectsEqualComparer else { assert(type0.equals(type1)); compareType = CompareType.SAME_TYPE; Dont assert same type ? types can be different - it wont happen for GroupBy Otherwise, it looks great
        Hide
        Siying Dong added a comment -

        Resolver conflicts to previous check-in

        Show
        Siying Dong added a comment - Resolver conflicts to previous check-in
        Hide
        Namit Jain added a comment -

        Can you regenerate the patch ? I am getting some conflicts

        Show
        Namit Jain added a comment - Can you regenerate the patch ? I am getting some conflicts
        Hide
        Siying Dong added a comment -

        One note: for the query above, input format is SequenceFile, which is not friendly to this kind of query. I convert the input to RCFile and do the same comparison against it, I can see Map's CPU_MILLISECONDS are improved from about 1,050,000 to about 965,000.

        Show
        Siying Dong added a comment - One note: for the query above, input format is SequenceFile, which is not friendly to this kind of query. I convert the input to RCFile and do the same comparison against it, I can see Map's CPU_MILLISECONDS are improved from about 1,050,000 to about 965,000.
        Hide
        Siying Dong added a comment -

        I compare performance for a simple group-by query:

        SELECT col1,col2,col3,col4,count(1)
        FROM source_table
        GROUP BY col1, col2, col3, col4

        which returns about 1000 rows.
        The input has about 736M rows in about 19GB compressed files.The query started 67 Mappers.

        Map's CPU MILLISECONDS are down from about 2,950,000 to about 2,800,000. I ran each query at least 5 times, and the improvement is consistent in every run.

        Show
        Siying Dong added a comment - I compare performance for a simple group-by query: SELECT col1,col2,col3,col4,count(1) FROM source_table GROUP BY col1, col2, col3, col4 which returns about 1000 rows. The input has about 736M rows in about 19GB compressed files.The query started 67 Mappers. Map's CPU MILLISECONDS are down from about 2,950,000 to about 2,800,000. I ran each query at least 5 times, and the improvement is consistent in every run.

          People

          • Assignee:
            Siying Dong
            Reporter:
            Siying Dong
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development