Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7074

The reducer parallelism should be a prime number for better stride protection

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • Statistics
    • None

    Description

      The current hive reducer parallelism results in stride issues with key distribution.

      a JOIN generating even numbers will get strided onto only some of the reducers.

      The probability of distribution skew is controlled by the number of common factors shared by the hashcode of the key and the number of buckets.

      Using a prime number within the reducer estimation will cut that probability down by a significant amount.

      Attachments

        1. HIVE-7074.1.patch
          3 kB
          Gopal Vijayaraghavan

        Issue Links

          Activity

            People

              gopalv Gopal Vijayaraghavan
              gopalv Gopal Vijayaraghavan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: