Hive
  1. Hive
  2. HIVE-4809

ReduceSinkOperator of PTFOperator can have redundant key columns

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.0, 0.12.0, 0.13.0, 0.14.0
    • Fix Version/s: 1.1.0
    • Component/s: PTF-Windowing
    • Labels:
      None

      Description

      For example, we have a simple query like this ...

      SELECT x.a, x.b, count(x.b) OVER (PARTITION BY x.a) FROM src x;
      

      The plan of it is ...

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Alias -> Map Operator Tree:
              x 
                TableScan
                  alias: x
                  Reduce Output Operator
                    key expressions:
                          expr: a
                          type: int
                          expr: a
                          type: int
                    sort order: ++
                    Map-reduce partition columns:
                          expr: a
                          type: int
                    tag: -1
                    value expressions:
                          expr: a
                          type: int
                          expr: b
                          type: string
            Reduce Operator Tree:
              Extract
                PTF Operator
                  Select Operator
                    expressions:
                          expr: _col0
                          type: int
                          expr: _col1
                          type: string
                          expr: _wcol0
                          type: bigint
                    outputColumnNames: _col0, _col1, _col2
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      

      The ReduceSinkOperator has two "a" in its key columns. This redundancy can increase the size of map output.

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          563d 1h 49m 1 Navis 17/Jan/15 03:08
          Patch Available Patch Available Resolved Resolved
          2d 14h 31m 1 Ashutosh Chauhan 19/Jan/15 17:40
          Brock Noland made changes -
          Fix Version/s 1.1.0 [ 12329363 ]
          Fix Version/s 0.15.0 [ 12328723 ]
          Ashutosh Chauhan made changes -
          Affects Version/s 0.14.0 [ 12326450 ]
          Affects Version/s 0.13.0 [ 12324986 ]
          Affects Version/s 0.12.0 [ 12324312 ]
          Ashutosh Chauhan made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.15.0 [ 12328723 ]
          Resolution Fixed [ 1 ]
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Navis!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Navis!
          Hide
          Ashutosh Chauhan added a comment -

          +1

          Show
          Ashutosh Chauhan added a comment - +1
          Hide
          Navis added a comment -

          Added RB link

          Show
          Navis added a comment - Added RB link
          Navis made changes -
          Remote Link This issue links to "review board (Web Link)" [ 21993 ]
          Hide
          Ashutosh Chauhan added a comment -

          Can you create a RB for this ?

          Show
          Ashutosh Chauhan added a comment - Can you create a RB for this ?
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12692892/HIVE-4809.1.patch.txt

          ERROR: -1 due to 13 failed/errored test(s), 7231 tests executed
          Failed tests:

          TestMiniTezCliDriver-script_pipe.q-insert_values_non_partitioned.q-insert_update_delete.q-and-12-more - did not produce a TEST-*.xml file
          TestMiniTezCliDriver-scriptfile1.q-union2.q-vectorized_bucketmapjoin1.q-and-12-more - did not produce a TEST-*.xml file
          TestMiniTezCliDriver-vector_decimal_10_0.q-vector_decimal_trailing.q-lvj_mapjoin.q-and-12-more - did not produce a TEST-*.xml file
          TestMiniTezCliDriver-vector_partitioned_date_time.q-vector_non_string_partition.q-tez_dml.q-and-12-more - did not produce a TEST-*.xml file
          TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file
          TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file
          TestMinimrCliDriver-leftsemijoin_mr.q-bucket5.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file
          TestNegativeMinimrCliDriver-mapreduce_stack_trace_hadoop20.q - did not produce a TEST-*.xml file
          TestNegativeMinimrCliDriver-udf_local_resource.q-mapreduce_stack_trace_turnoff_hadoop20.q-mapreduce_stack_trace.q-and-5-more - did not produce a TEST-*.xml file
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric
          org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1
          org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler
          org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchCommit_Json
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2409/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2409/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2409/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 13 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12692892 - PreCommit-HIVE-TRUNK-Build

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12692892/HIVE-4809.1.patch.txt ERROR: -1 due to 13 failed/errored test(s), 7231 tests executed Failed tests: TestMiniTezCliDriver-script_pipe.q-insert_values_non_partitioned.q-insert_update_delete.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-scriptfile1.q-union2.q-vectorized_bucketmapjoin1.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vector_decimal_10_0.q-vector_decimal_trailing.q-lvj_mapjoin.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vector_partitioned_date_time.q-vector_non_string_partition.q-tez_dml.q-and-12-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-leftsemijoin_mr.q-bucket5.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file TestNegativeMinimrCliDriver-mapreduce_stack_trace_hadoop20.q - did not produce a TEST-*.xml file TestNegativeMinimrCliDriver-udf_local_resource.q-mapreduce_stack_trace_turnoff_hadoop20.q-mapreduce_stack_trace.q-and-5-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1 org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchCommit_Json Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2409/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2409/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2409/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 13 tests failed This message is automatically generated. ATTACHMENT ID: 12692892 - PreCommit-HIVE-TRUNK-Build
          Navis made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Assignee Yin Huai [ yhuai ] Navis [ navis ]
          Navis made changes -
          Attachment HIVE-4809.1.patch.txt [ 12692892 ]
          Hide
          Navis added a comment -

          Mostly done by HIVE-4867.

          Show
          Navis added a comment - Mostly done by HIVE-4867 .
          Ashutosh Chauhan made changes -
          Affects Version/s 0.11.0 [ 12323587 ]
          Ashutosh Chauhan made changes -
          Component/s PTF-Windowing [ 12320378 ]
          Hide
          Yin Huai added a comment -

          For a OVER clause, we can have partitioning columns (specified by PARTITION BY) and ordering columns (specified by ORDER BY). In the current implementation, we use the key columns of ReduceSinkOperator (RS) to take care both grouping (for those partitioning columns) and ordering (for those ordering columns). So, we first add all partitioning columns and then add all ordering columns to the key columns of the RS. If we do not specify ordering columns, we will use partitioning columns as ordering columns. Seems we cannot completely remove those duplicate key columns right now (because key columns of RS need to take care both grouping and ordering). But, we can optimize certain cases. For example, if ordering columns are not specified, we do not assign those partition columns to ordering columns.

          Show
          Yin Huai added a comment - For a OVER clause, we can have partitioning columns (specified by PARTITION BY) and ordering columns (specified by ORDER BY). In the current implementation, we use the key columns of ReduceSinkOperator (RS) to take care both grouping (for those partitioning columns) and ordering (for those ordering columns). So, we first add all partitioning columns and then add all ordering columns to the key columns of the RS. If we do not specify ordering columns, we will use partitioning columns as ordering columns. Seems we cannot completely remove those duplicate key columns right now (because key columns of RS need to take care both grouping and ordering). But, we can optimize certain cases. For example, if ordering columns are not specified, we do not assign those partition columns to ordering columns.
          Yin Huai made changes -
          Field Original Value New Value
          Assignee Yin Huai [ yhuai ]
          Yin Huai created issue -

            People

            • Assignee:
              Navis
              Reporter:
              Yin Huai
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development