Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5903

If Kerberos Authentication is enabled, MapReduce job is failing on reducer phase

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Invalid
    • 2.4.0
    • None
    • None
    • hadoop: 2.4.0.2.1.2.0

    Description

      I have 3-node cluster configuration: 1 ResourceManager and 3 NodeManagers, Kerberos is enabled, have hdfs, yarn, mapred principals\keytabs. ResourceManager and NodeManager are ran under yarn user, using yarn Kerberos principal.
      Use case 1: WordCount, submit job using yarn UGI (i.e. superuser, the one having Kerberos principal on all boxes). Result: job successfully completed.
      Use case 2: WordCount, submit job using LDAP user impersonation via yarn UGI. Result: Map tasks are completed SUCCESSfully, Reduce task fails with ShuffleError Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES (see the stack trace below).
      The use case with user impersonation used to work on earlier versions, without YARN (with JT&TT).

      I found similar issue with Kerberos AUTH involved here: https://groups.google.com/forum/#!topic/nosql-databases/tGDqs75ACqQ
      And here https://issues.apache.org/jira/browse/MAPREDUCE-4030 it's marked as resolved, which is not the case when Kerberos Authentication is enabled.

      The exception trace from YarnChild JVM:
      2014-05-21 12:49:35,687 FATAL fetcher#3 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Shuffle failed with too many fetch failures and insufficient progress!
      2014-05-21 12:49:35,688 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3
      at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:416)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
      Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
      at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:323)
      at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:245)
      at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
      at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)

      Attachments

        Activity

          People

            Unassigned Unassigned
            grelaxus Victor Kim
            Votes:
            2 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: