Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2220

Reduce tasks fail too easily because of repeated fetch failures

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.16.0
    • 0.16.0
    • None
    • None

    Description

      Currently reduce tasks with more than MAX_FAILED_UNIQUE_FETCHES (= 5 hard-coded) failures to fetch output from different mappers will fail (I believe, introduced in HADOOP-1158)

      This gives us some problems with longer running jobs with a large number of mappers in multiple waves:
      Otherwise problem-less reduce tasks fail because of too many fetch failures due to resource contention, and new reduce tasks have to fetch all data from the already successfully executed mappers, introducing a lot of additional IO overhead. Also, the job will fail when the same reducer exhausts the maximum number of attempts.

      The limit should be a function of number of mappers and/or waves of mappers, and should be more conservative (e.g. no need to let them fail when there are enough slots to start speculatively executed reducers and speculative execution is enabled). Also, we might consider to not count such a restart against the number of attempts.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            amar_kamat Amar Kamat
            ckunz Christian Kunz
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment