Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-3640

Reducer allocation is incorrect if enforce bucketing and mapred.reduce.tasks are both set

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.10.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      When I enforce bucketing and fix the number of reducers via mapred.reduce.tasks Hive ignores my input and instead takes the largest value <= hive.exec.reducers.max that is also an even divisor of num_buckets. In other words, if I set 1024 buckets and set mapred.reduce.tasks=1024 I'll get. . . 256 reducers. If I set 1997 buckets and set mapred.reduce.tasks=1997 I'll get. . . 1 reducer.

      This is totally crazy, and it's far, far crazier when the data inputs get large. In the latter case the bucketing job will almost certainly fail because we'll most likely try to stuff several TB of input through a single reducer. We'll also drastically reduce the effectiveness of bucketing, since the buckets themselves will be larger.

      If the user sets mapred.reduce.tasks in a query that inserts into a bucketed table we should either accept that value or raise an exception if it's invalid relative to the number of buckets. We should absolutely NOT override the user's direction and fall back on automatically allocating reducers based on some obscure logic dictated by completely different setting.

      I have yet to encounter a single person who expected this the first time, so it's clearly a bug.

        Attachments

          Activity

            People

            • Assignee:
              vighnesh.avadhani Vighnesh Avadhani
              Reporter:
              vighnesh.avadhani Vighnesh Avadhani
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified