Hive
  1. Hive
  2. HIVE-384

problem in union if the first subquery is a map-only job

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3.0, 0.4.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      HIVE-384. Fixing UNION ALL when the first job is a map-only job. (Namit Jain via zshao)

      Description

      Union needs special handling.

      explain
      select unionsrc.key, count(1) FROM (select s2.key as key, s2.value as value from src1 s2
      UNION ALL
      select 'tst1' as key, cast(count(1) as string) as value from src s1)
      unionsrc group by unionsrc.key;

      results in a null pointer exception

      1. hive.384.6.patch
        72 kB
        Namit Jain
      2. hive.384.5.patch
        67 kB
        Namit Jain
      3. hive.384.4.patch
        74 kB
        Namit Jain
      4. hive.384.3.patch
        74 kB
        Namit Jain
      5. hive.384.2.patch
        1 kB
        Namit Jain
      6. hive.384.1.patch
        70 kB
        Namit Jain

        Activity

        Hide
        Namit Jain added a comment -

        added a fix for number of jobs also

        Show
        Namit Jain added a comment - added a fix for number of jobs also
        Hide
        Namit Jain added a comment -

        The patch has a few bug fixes:

        1. The total number of jobs were counted wrongly, which made it very difficult to debug (specially big union queries).
        2. The job name for a query was incorrect, , which made it very difficult to debug (specially big union queries).

        These two are fixed in Driver.java

        3. Union plans were very complex and inefficient - since the unions were not getting merged - the simple fix to to merge
        them in the SemanticAnalyzer - no need to check anything, union schema should be the same.

        4. No special case for ReduceSink followed by Union – the fix is in GenMRRedSink3

        Added a bunch of new tests

        Show
        Namit Jain added a comment - The patch has a few bug fixes: 1. The total number of jobs were counted wrongly, which made it very difficult to debug (specially big union queries). 2. The job name for a query was incorrect, , which made it very difficult to debug (specially big union queries). These two are fixed in Driver.java 3. Union plans were very complex and inefficient - since the unions were not getting merged - the simple fix to to merge them in the SemanticAnalyzer - no need to check anything, union schema should be the same. 4. No special case for ReduceSink followed by Union – the fix is in GenMRRedSink3 Added a bunch of new tests
        Hide
        Zheng Shao added a comment -

        It seems that the new test results print out the GroupByOperator after ReduceOutputOperator in the "Map Operator Tree", while it is already printed out in the "Reduce Operator Tree:" section.

        +              Reduce Output Operator
        +                key expressions:
        +                      expr: 0
        +                      type: string
        +                sort order: +
        +                Map-reduce partition columns:
        +                      expr: 0
        +                      type: string
        +                tag: -1
        +                value expressions:
        +                      expr: 1
        +                      type: bigint
        +                Group By Operator
        +                  aggregations:
        +                        expr: count(VALUE.0)
        +                  keys:
        +                        expr: KEY.0
        +                        type: string
        +                  mode: mergepartial
        +                  Select Operator
        +                    expressions:
        +                          expr: 0
        +                          type: string
        +                          expr: 1
        +                          type: bigint
        +                    File Output Operator
        +                      compressed: false
        +                      GlobalTableId: 0
        +                      table:
        +                          input format: org.apache.hadoop.mapred.TextInputFormat
        +                          output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
        +        /data/users/njain/hive1/hive/build/ql/tmp/58586108/201879798.10003 
        
        Show
        Zheng Shao added a comment - It seems that the new test results print out the GroupByOperator after ReduceOutputOperator in the "Map Operator Tree", while it is already printed out in the "Reduce Operator Tree:" section. + Reduce Output Operator + key expressions: + expr: 0 + type: string + sort order: + + Map-reduce partition columns: + expr: 0 + type: string + tag: -1 + value expressions: + expr: 1 + type: bigint + Group By Operator + aggregations: + expr: count(VALUE.0) + keys: + expr: KEY.0 + type: string + mode: mergepartial + Select Operator + expressions: + expr: 0 + type: string + expr: 1 + type: bigint + File Output Operator + compressed: false + GlobalTableId: 0 + table: + input format: org.apache.hadoop.mapred.TextInputFormat + output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat + /data/users/njain/hive1/hive/build/ql/tmp/58586108/201879798.10003
        Hide
        Namit Jain added a comment -

        that is because of map-side aggregation

        Show
        Namit Jain added a comment - that is because of map-side aggregation
        Hide
        Namit Jain added a comment -

        Zheng pointed out another unrelated bug in the code for explain. The kids of reduce sink was not made null for all dependent tasks, so only explain plans
        of top level tasks were correct.

        Although, the reduce sink operator would ignore the kids, so there is no runtime bug

        Show
        Namit Jain added a comment - Zheng pointed out another unrelated bug in the code for explain. The kids of reduce sink was not made null for all dependent tasks, so only explain plans of top level tasks were correct. Although, the reduce sink operator would ignore the kids, so there is no runtime bug
        Hide
        Namit Jain added a comment -

        fixed explain plan bug and updated a bunch of log files

        Show
        Namit Jain added a comment - fixed explain plan bug and updated a bunch of log files
        Hide
        Zheng Shao added a comment -

        Thanks Namit.

        Show
        Zheng Shao added a comment - Thanks Namit.
        Hide
        Johan Oskarsson added a comment -

        Looks like this patch cause two unit tests to fail:
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union10
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union12

        http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/lastBuild/testReport/

        Show
        Johan Oskarsson added a comment - Looks like this patch cause two unit tests to fail: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union12 http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.19/lastBuild/testReport/
        Hide
        Namit Jain added a comment -

        I will look into this

        Show
        Namit Jain added a comment - I will look into this
        Hide
        Johan Oskarsson added a comment -

        I created HIVE-397 to track the test fixing

        Show
        Johan Oskarsson added a comment - I created HIVE-397 to track the test fixing

          People

          • Assignee:
            Namit Jain
            Reporter:
            Namit Jain
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development