Hive
  1. Hive
  2. HIVE-7661

Observed performance issues while sorting using Hive's Parallel Order by clause while retaining pre-existing sort order.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.12.0
    • Fix Version/s: 0.12.1
    • Component/s: Logical Optimizer
    • Labels:
    • Environment:

      Cloudera 5.0
      hive-0.12.0-cdh5.0.0
      Red Hat Linux

    • Tags:
      Parallel Order By

      Description

      Improve Hive's sampling logic to accommodate use cases that require to retain the pre existing sort in the underlying source table.

      In order to support Parallel order by clause, Hive Samples the source table based on values provided to hive.optimize.sampling.orderby.number and hive.optimize.sampling.orderby.percent.

      This does work with reasonable performance when sorting is performed on a columns having random distribution of data but has severe performance issues when retaining the sort order.

      Let us try to understand this with an example.

      insert overwrite table lineitem_temp_report
      select
      l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
      from
      lineitem
      order by l_orderkey, l_partkey, l_suppkey;

      Sample data set for lineitem table. The first column represents the l_orderKey and is sorted.

      l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment
      197|1771022|96040|2|8|8743.52|0.09|0.02|A|F|1995-04-17|1995-07-01|1995-0
      197|1771022|96040|2|8|4-27|DELIVER IN PERSON|SHIP|y blithely even
      197|1771022|96040|2|8|deposits. blithely fina|
      197|1558290|83306|3|17|22919.74|0.06|0.02|N|O|1995-08-02|1995-06-23|1995
      197|1558290|83306|3|17|-08-03|COLLECT COD|REG AIR|ts. careful|
      197|179355|29358|4|25|35858.75|0.04|0.01|N|F|1995-06-13|1995-05-23|1995-
      197|179355|29358|4|25|06-24|TAKE BACK RETURN|FOB|s-- quickly final
      197|179355|29358|4|25|accounts|
      197|414653|39658|5|14|21946.82|0.09|0.01|R|F|1995-05-08|1995-05-24|1995-
      197|414653|39658|5|14|05-12|TAKE BACK RETURN|RAIL|use slyly slyly silent
      197|414653|39658|5|14|depo|
      197|1058800|8821|6|1|1758.75|0.07|0.05|N|O|1995-07-15|1995-06-21|1995-08
      197|1058800|8821|6|1|-11|COLLECT COD|RAIL| even, thin dependencies sno|
      198|560609|60610|1|33|55096.14|0.07|0.02|N|O|1998-01-05|1998-03-20|1998-
      198|560609|60610|1|33|01-10|TAKE BACK RETURN|TRUCK|carefully caref|
      198|152287|77289|2|20|26785.60|0.03|0.00|N|O|1998-01-15|1998-03-31|1998-
      198|152287|77289|2|20|01-25|DELIVER IN PERSON|FOB|carefully final
      198|152287|77289|2|20|escapades a|
      224|1899665|74720|3|41|68247.37|0.07|0.04|A|F|1994-09-01|1994-09-15|1994
      224|1899665|74720|3|41|-09-02|TAKE BACK RETURN|SHIP|after the furiou|

      When we try to either sort on a presorted column or do a multi-column sort while trying to retain the sort order on the source table,

      Source table "lineitem" has 600 million rows.

      We don't see equal distribution of data to the reducers. Out of 100 reducers, 99 complete in less than 40 seconds. The last reducer is doing the bulk of the work processing nearly 570 million rows.

      So, let us understand what is going wrong here ..

      on a table having 600 million records with orderkey column sorted, i created temp table with 10% sampling.

      insert overwrite table sampTempTbl (select * from lineitem tablesample (10 percent) t);

      select min(l_orderkey), max(l_orderkey) from sampTempTbl ;
      12306309, 142321700

      where as on the source table, the orderkey range (select min(l_orderkey), max(l_orderkey) from lineitem) is 1 and 600000000

      So naturally bulk of the records will be directed towards single reducer.

      One way to work around this problem is to increase the hive.optimize.sampling.orderby.number to a larger value (as close as the # rows in the input source table). But then we will have to provide higher heap (hive-env.sh) for hive, otherwise it will fail while creating the Sampling Data. With larger data volume, it is not practical to sample the entire data set.

        Issue Links

          Activity

          Hide
          Navis added a comment -

          Vishal Kamath Thinking of implementing something like a InputSampler.SplitSampler. Would it be helpful for this case?

          Show
          Navis added a comment - Vishal Kamath Thinking of implementing something like a InputSampler.SplitSampler. Would it be helpful for this case?
          Hide
          Vishal Kamath added a comment -

          I think this might help.

          InputSampler.SplitSampler<K,V> would sample the first n records from s splits.
          getSample() would take the first numSamples/numSplits records

          Would it be possible for you to explain your approach with an example ? Let us that you have 1,000,000 rows in your file and say 10 splits.

          Show
          Vishal Kamath added a comment - I think this might help. InputSampler.SplitSampler<K,V> would sample the first n records from s splits. getSample() would take the first numSamples/numSplits records Would it be possible for you to explain your approach with an example ? Let us that you have 1,000,000 rows in your file and say 10 splits.
          Hide
          Navis added a comment -

          Vishal Kamath Booked an issue for split sampling described above. Wish it's helpful.

          Show
          Navis added a comment - Vishal Kamath Booked an issue for split sampling described above. Wish it's helpful.

            People

            • Assignee:
              Unassigned
              Reporter:
              Vishal Kamath
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development