Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4867

Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None

      Description

      A ReduceSinkOperator emits data in the format of keys and values. Right now, a column may appear in both the key list and value list, which result in unnecessary overhead for shuffling.

      Example:
      We have a query shown below ...

      explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
      

      The plan is ...

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Alias -> Map Operator Tree:
              store_sales 
                TableScan
                  alias: store_sales
                  Select Operator
                    expressions:
                          expr: ss_ticket_number
                          type: int
                    outputColumnNames: _col0
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: int
                      sort order: +
                      Map-reduce partition columns:
                            expr: _col0
                            type: int
                      tag: -1
                      value expressions:
                            expr: _col0
                            type: int
            Reduce Operator Tree:
              Extract
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      
      

      The column 'ss_ticket_number' is in both the key list and value list of the ReduceSinkOperator. The type of ss_ticket_number is int. For this case, BinarySortableSerDe will introduce 1 byte more for every int in the key. LazyBinarySerDe will also introduce overhead when recording the length of a int. For every int, 10 bytes should be a rough estimation of the size of data emitted from the Map phase.

        Attachments

        1. HIVE-4867.1.patch.txt
          2.01 MB
          Navis
        2. HIVE-4867.2.patch.txt
          2.84 MB
          Navis
        3. HIVE-4867.3.patch.txt
          2.80 MB
          Navis
        4. HIVE-4867.4.patch.txt
          2.96 MB
          Navis
        5. HIVE-4867.5.patch.txt
          3.01 MB
          Navis
        6. source_only.txt
          74 kB
          Navis

          Issue Links

            Activity

              People

              • Assignee:
                navis Navis
                Reporter:
                yhuai Yin Huai
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: