Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3293

Fetch failures can cause a shuffle hang waiting for memory merge that never starts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.7.1, 0.8.3
    • 0.7.2, 0.9.0, 0.8.4
    • None
    • None

    Description

      Tez jobs can hang in shuffle waiting for a memory merge that never starts. When a MapOutput is reserved it increments usedMemory but when it is unreserved it decrements usedMemory and commitMemory. If enough shuffle failures occur of sufficient size then commitMemory may never reach the merge threshold even after all outstanding transfers have committed and thus hang the shuffle.

      Attachments

        1. TEZ-3293.001-branch-0.7.patch
          4 kB
          Siddharth Seth
        2. TEZ-3293.001.patch
          5 kB
          Jason Darrell Lowe

        Issue Links

          Activity

            People

              jlowe Jason Darrell Lowe
              jlowe Jason Darrell Lowe
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: