Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4867

Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.14.0
    • None
    • None

    Description

      A ReduceSinkOperator emits data in the format of keys and values. Right now, a column may appear in both the key list and value list, which result in unnecessary overhead for shuffling.

      Example:
      We have a query shown below ...

      explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
      

      The plan is ...

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Alias -> Map Operator Tree:
              store_sales 
                TableScan
                  alias: store_sales
                  Select Operator
                    expressions:
                          expr: ss_ticket_number
                          type: int
                    outputColumnNames: _col0
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: int
                      sort order: +
                      Map-reduce partition columns:
                            expr: _col0
                            type: int
                      tag: -1
                      value expressions:
                            expr: _col0
                            type: int
            Reduce Operator Tree:
              Extract
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      
      

      The column 'ss_ticket_number' is in both the key list and value list of the ReduceSinkOperator. The type of ss_ticket_number is int. For this case, BinarySortableSerDe will introduce 1 byte more for every int in the key. LazyBinarySerDe will also introduce overhead when recording the length of a int. For every int, 10 bytes should be a rough estimation of the size of data emitted from the Map phase.

      Attachments

        1. source_only.txt
          74 kB
          Navis Ryu
        2. HIVE-4867.5.patch.txt
          3.01 MB
          Navis Ryu
        3. HIVE-4867.4.patch.txt
          2.96 MB
          Navis Ryu
        4. HIVE-4867.3.patch.txt
          2.80 MB
          Navis Ryu
        5. HIVE-4867.2.patch.txt
          2.84 MB
          Navis Ryu
        6. HIVE-4867.1.patch.txt
          2.01 MB
          Navis Ryu

        Issue Links

          Activity

            People

              navis Navis Ryu
              yhuai Yin Huai
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: