Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1296

Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even though other disks still have space.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.20.2
    • None
    • capacity-sched
    • None

    Description

      Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even though other disks still have space.

      In a cluster, data is distributed almost uniformly. Disk /grid/0/ reaches 100% first, because of extra filling up of info like logs etc. After it reaches 100% tasks starts to fail with the error,

      java.lang.Throwable: Child Error
      at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:516)
      Caused by: java.io.IOException: Task process exit with nonzero status of 1.
      at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:503)

      This happens even though the other disks are still at 80%, so still can be filled up more.

      Steps to reproduce:

      1) Bring up a cluster with Linux task controller.
      2) Start filling the dfs up with data using randomwriter or teragen.
      3) Once the first disk reaches 100%, the tasks are starting to fail.

      Attachments

        Activity

          People

            Unassigned Unassigned
            iyappans Iyappan Srinivasan
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: