Hadoop Common
  1. Hadoop Common
  2. HADOOP-3859

1000 concurrent read on a single file failing the task/client

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.17.1
    • Fix Version/s: 0.17.2
    • Component/s: None
    • Labels:
      None
    • Environment:

      0.17.2 (0.17.1-H3002-H3633-H3681-H3685-H3370-H3707-H3760-H3758)

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Allows the user to change the maximum number of xceivers in the datanode.

      Description

      After fixing Hadoop-3633, some users started seeing their tasks fail with

      08/07/29 05:13:07 INFO mapred.JobClient: Task Id : task_200807290511_0001_m_000846_0, Status : FAILED
      java.io.IOException: Could not obtain block: blk_-7893038518783920880 file=/tmp/files111
              at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1430)
              at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1281)
              at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1385)
              at java.io.DataInputStream.read(DataInputStream.java:83)
              at org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
              at org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
              at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:179)
              at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:50)
              at org.apache.hadoop.mapred.MapTask.run(MapTask.java:211)
              at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
      

      This happened when hundreds of mappers pulled the same file concurrently.

      1. HADOOP-3859.patch
        2 kB
        Johan Oskarsson

        Issue Links

          Activity

          Hide
          Raghu Angadi added a comment - - edited

          As you can see in HADOOP-3633, it was conscious decision not to make this configurable. I didn't even try to make it public in this jira. It is expected to be a private config variable. Do we have any reasons to make it public (I can think of a couple I guess)?

          Show
          Raghu Angadi added a comment - - edited As you can see in HADOOP-3633 , it was conscious decision not to make this configurable. I didn't even try to make it public in this jira. It is expected to be a private config variable. Do we have any reasons to make it public (I can think of a couple I guess)?
          Hide
          Nigel Daley added a comment -

          Johan, was there a reason that this config options wasn't added to hadoop-defaults.xml?

          Show
          Nigel Daley added a comment - Johan, was there a reason that this config options wasn't added to hadoop-defaults.xml?
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Johan!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Johan!
          Hide
          Raghu Angadi added a comment -

          +1. Looks good to me. I don't see any advantage not to make it a configurable.

          Show
          Raghu Angadi added a comment - +1. Looks good to me. I don't see any advantage not to make it a configurable.
          Hide
          Johan Oskarsson added a comment -

          This patch makes max xceiver into a setting so that users with large clusters can adjust this knob themselves. It's not a perfect fix but I think it's a start so we can release 0.17.2.

          Show
          Johan Oskarsson added a comment - This patch makes max xceiver into a setting so that users with large clusters can adjust this knob themselves. It's not a perfect fix but I think it's a start so we can release 0.17.2.
          Hide
          Koji Noguchi added a comment -

          For this user, it happened when running on 200 nodes with 6 map slots per node.

          Using distributedCache or increasing the replication would fix it, but problem still remains if we have a larger mapred cluster.

          Show
          Koji Noguchi added a comment - For this user, it happened when running on 200 nodes with 6 map slots per node. Using distributedCache or increasing the replication would fix it, but problem still remains if we have a larger mapred cluster.
          Hide
          Koji Noguchi added a comment -

          On the datanode side, log showed

          2008-07-29 05:12:08,758 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException:
          xceiverCount 257 exceeds the limit of concurrent xcievers 256
            at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977)
            at java.lang.Thread.run(Thread.java:619)
          
          2008-07-29 05:12:08,775 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException:
          xceiverCount 258 exceeds the limit of concurrent xcievers 256
            at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977)
            at java.lang.Thread.run(Thread.java:619)
          
          2008-07-29 05:12:08,776 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException:
          xceiverCount 258 exceeds the limit of concurrent xcievers 256
            at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977)
            at java.lang.Thread.run(Thread.java:619)
          
          2008-07-29 05:12:08,780 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException:
          xceiverCount 257 exceeds the limit of concurrent xcievers 256
            at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977)
            at java.lang.Thread.run(Thread.java:619)
          
          Show
          Koji Noguchi added a comment - On the datanode side, log showed 2008-07-29 05:12:08,758 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977) at java.lang.Thread.run(Thread.java:619) 2008-07-29 05:12:08,775 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977) at java.lang.Thread.run(Thread.java:619) 2008-07-29 05:12:08,776 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977) at java.lang.Thread.run(Thread.java:619) 2008-07-29 05:12:08,780 ERROR org.apache.hadoop.dfs.DataNode: 11.111.11.11:50010:DataXceiver: java.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:977) at java.lang.Thread.run(Thread.java:619)

            People

            • Assignee:
              Johan Oskarsson
              Reporter:
              Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development