Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6131

Integer overflow in RMContainerAllocator results in starvation of applications

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • None
    • None
    • None
    • None

    Description

      When processing large datasets, Hadoop encounters a scenario where all
      containers run reduce tasks and no map tasks are scheduled. The
      application does not fail but rather remains in this state without making
      any forward progress. It then has to be manually terminated.

      This bug is due to integer overflow in scheduleReduces() of
      RMContainerAllocator. The variable netScheduledMapMem overflows for
      large data sizes, takes negative value, and results in a large
      finalReduceMemLimit and a large rampup value. In almost all cases, this
      large rampup value is greater than the total number of reduce tasks.
      Therefore, the AM tries to assign all reduce tasks. And if the total number
      of reduce tasks is greater than the total container slots, then all slots are
      taken up by reduce tasks, leaving none for maps.

      With 128MB block size and 2GB map container size, overflow occurs with 128 TB data size. An example scenario for the reproduction is:

      • Input data size of 32TB, block size 128MB, Map container size = 10GB,
        reduce container size = 10GB, #reducers = 50, cluster mem capacity = 7 x 40GB, slowstart=0.0

      Better resolution might be to change the variables used in
      RMContainerAllocator from int to long. A simpler fix instead would be to
      only change the local variables of scheduleReduces() to long data types.
      Patch is attached for 2.2.0.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kapache Kamal Kc
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: