Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5170

Sorted Bucketed Partitioned Insert hard-codes the reducer count == bucket count

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Done
    • 0.12.0
    • None
    • Query Processor
    • None
    • Ubuntu LXC

    Description

      When performing a hive sorted-partitioned insert, the insert optimizer hard-codes the number of output files to the actual bucket count of the table.

      https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L4852

      We need at least that many reducers or if limited, switch to multi-spray (as implemented already), but more reducers is wasteful as long as the HiveKey only contains the partition columns.

      At this point, we're limited to reducers = n-bucket still, which is a problem for partitioning requests which need to insert nearly a terabyte of data into a single-digit bucket count and four-digit partition count.

      Since that is routed by the hasCode of the HiveKey, we can ensure that works by modifying the HiveKey to handle n-buckets internally.

      Basically it should only generate hashCode = (sort_cols.hashCode() % n) routing only to n reducers over-all, despite how many we spin up.

      So far so good with the hard-coded reducer count.

      But provided we fix the issues brought up by HIVE-5169, the insert becomes friendlier to a higher reducer count as well.

      At this juncture, we can modify the hashCode to be slightly more interesting.

      hashCode = (part_cols.hashCode()*31 + (sort_cols.hashCode() % n))

      This generates somewhere between n to partition_count * n unique hash-codes.

      Since the sort-order & bucketing has to be maintained per-partition dir, distributing this equally across any number of reducers will result in the scale-out of the reducer count.

      This will allow a reducer count that will allow for far faster inserts of ORC data into a partitioned/sorted table.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gopalv Gopal Vijayaraghavan
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: