Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-162

DataNode fails to deliver blocks, holds thousands of open socket connections

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Hadoop-0.18.0, 7 node Linux grid (6 DataNodes, 1 master node)
      Hadoop-0.18.0, 20 EC2 Linux grid (19 DataNodes, 1 master node)

      Description

      9/27 update: uploaded the logs, with hopefully all the bits that should be examined. If other things are needed, just let me know. Note that all the paths refer to 0.18.1. This is still an 18.0 installation using the 18.0 core jar, just installed to a non-standard location.

      9/26 update: we have successfully reproduced this using Hadoop 0.18 as well. The problem happens on both our own network infrastructure as well as on an Amazon EC2 cluster running CentOS5 images. I'll be attaching the logs Raghu asked for shortly.

      A job that used to run correctly on our grid (in 0.15.0) now fails. The failure occurs after the map phase is complete, and about 2/3rds of the way through the reduce phase. This job is processing a modest amount of input data (approximately 220G)

      When the error occurs the nodes hosting DataNodes have literally thousands of open socket connections on them. The DataNode instances are holding large amounts of memory. Sometimes the DataNodes crash or exit, other times they continue to run.

      The error which gets kicked out from the application perspective is:

      08/05/27 11:30:08 INFO mapred.JobClient: map 100% reduce 89%
      08/05/27 11:30:41 INFO mapred.JobClient: map 100% reduce 90%
      08/05/27 11:32:45 INFO mapred.JobClient: map 100% reduce 86%
      08/05/27 11:32:45 INFO mapred.JobClient: Task Id :
      task_200805271056_0001_r_000007_0, Status : FAILED
      java.io.IOException: Could not get block locations. Aborting...
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanode
      Error(DFSClient.java:1832)
      at
      org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1487)
      at
      org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1579)

      I then discovered that 1 or more DataNode instances on the slave nodes
      are down (we run 1 DataNode instance per machine). The cause for at
      least some of the DataNode failures is a JVM internal error that gets
      raised due to a complete out-of-memory scenario (on a 4G, 4-way machine).

      Watching the DataNodes run, I can see them consuming more and more
      memory. For those failures for which there is a JVM traceback, I see (in
      part...NOTE 0.16.4 TRACEBACK):
      #

      1. java.lang.OutOfMemoryError: requested 16 bytes for CHeapObj-new. Out
        of swap space?
        #
      2. Internal Error (414C4C4F434154494F4E0E494E4C494E450E4850500017),
        pid=4246, tid=2283883408
        #
      3. Java VM: Java HotSpot(TM) Server VM (1.6.0_02-b05 mixed mode)
      4. If you would like to submit a bug report, please visit:
      5. http://java.sun.com/webapps/bugreport/crash.jsp
        #
                                  • T H R E A D ---------------
                                    Current thread (0x8a942000): JavaThread
                                    "org.apache.hadoop.dfs.DataNode$DataXceiver@3f4f44" daemon [_thread_in_Java, id=15064]
                                    Stack: [0x881c4000,0x88215000), sp=0x882139e0, free space=318k
                                    Native frames: (J=compiled Java code, j=interpreted, Vv=VM code,
                                    C=native code)
                                    V [libjvm.so+0x53b707]
                                    V [libjvm.so+0x225fe1]
                                    V [libjvm.so+0x16fdc5]
                                    V [libjvm.so+0x22aef3]
                                    Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
                                    v blob 0xf4f235a7
                                    J java.io.DataInputStream.readInt()I
                                    j
                                    org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(Ljava/io/DataOutputStream;Ljava/io/DataInputStream;Ljava/io/DataOutputStream;Ljava/lang/String;Lorg/a
                                    pache/hadoop/dfs/DataNode$Throttler;I)V+126
                                    j
                                    org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(Ljava/io/DataInputStream;)V+746
                                    j org.apache.hadoop.dfs.DataNode$DataXceiver.run()V+174
                                    j java.lang.Thread.run()V+11
                                    v ~StubRoutines::call_stub
                                  • P R O C E S S ---------------
                                    Java Threads: ( => current thread )
                                    0x0ae3f400 JavaThread "process reaper" daemon [_thread_blocked,
                                    id=26870]
                                    0x852e6000 JavaThread
                                    "org.apache.hadoop.dfs.DataNode$DataXceiver@e5dce1" daemon [_thread_in_vm, id=26869]
                                    0x08a1cc00 JavaThread "PacketResponder 0 for Block
                                    blk_-6186975972786687394" daemon [_thread_blocked, id=26769]
                                    0x852e5000 JavaThread
                                    "org.apache.hadoop.dfs.DataNode$DataXceiver@c40bf8" daemon [_thread_in_native, id=26768]
                                    0x0956e000 JavaThread "PacketResponder 0 for Block
                                    blk_-2322514873363546651" daemon [_thread_blocked, id=26767]
                                    0x852e4400 JavaThread
                                    "org.apache.hadoop.dfs.DataNode$DataXceiver@1ca61f9" daemon [_thread_in_native, id=26766]
                                    0x09d3a400 JavaThread "PacketResponder 0 for Block
                                    blk_8926941945313450801" daemon [_thread_blocked, id=26764]
                                    0x852e3c00 JavaThread
                                    "org.apache.hadoop.dfs.DataNode$DataXceiver@1e186d9" daemon [_thread_in_native, id=26763]
                                    0x0953d000 JavaThread "PacketResponder 0 for Block
                                    blk_4785883052769066976" daemon [_thread_blocked, id=26762]
                                    0xb13a5c00 JavaThread
                                    "org.apache.hadoop.dfs.DataNode$DataXceiver@13d62aa" daemon [_thread_in_native, id=26761]

      The interesting part here is that if I count the number of JavaThreads
      running org.apache.hadoop.dfs.DataNode I see 4,538 in the
      traceback. The number of threads was surprising.

      Other DataNodes just exit without panicking the JVM. In either failure
      mode, the last few lines of the DataNode log file is apparently
      innocuous:

      2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
      got response for connect ack from downstream datanode with
      firstbadlink as
      2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
      forwarding connect ack to upstream firstbadlink is
      2008-05-27 11:31:48,268 INFO org.apache.hadoop.dfs.DataNode: Receiving
      block blk_-2241766430103062484 src: /10.2.14.10:33626 dest:
      /10.2.14.10:50010
      2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Receiving
      block blk_313239508245918539 src: /10.2.14.24:37836 dest:
      /10.2.14.24:50010
      2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
      forwarding connect ack to upstream firstbadlink is
      2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Receiving
      block blk_1684581399908730353 src: /10.2.14.16:51605 dest:
      /10.2.14.16:50010
      2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
      forwarding connect ack to upstream firstbadlink is
      2008-05-27 11:31:49,509 INFO org.apache.hadoop.dfs.DataNode: Receiving
      block blk_2493969670086107736 src: /10.2.14.18:47557 dest:
      /10.2.14.18:50010
      2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
      got response for connect ack from downstream datanode with
      firstbadlink as
      2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
      forwarding connect ack to upstream firstbadlink is

      Finally, the task-level output (in userlogs) doesn't reveal much
      either:

      2008-05-27 11:38:30,724 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1 Need 34 map output(s)
      2008-05-27 11:38:30,753 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1 done copying
      task_200805271056_0001_m_001976_0 output from worker9.
      2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1: Got 0 new map-outputs & 0 obsolete
      map-outputs from tasktracker and 0 map-outputs from previous failures
      2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1 Got 33 known map output location(s);
      scheduling...
      2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1 Scheduled 1 of 33 known outputs (0 slow
      hosts and 32 dup hosts)
      2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1 Copying task_200805271056_0001_m_001248_0
      output from worker8.
      2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1 Need 33 map output(s)
      2008-05-27 11:38:31,752 INFO org.apache.hadoop.mapred.ReduceTask:
      task_200805271056_0001_r_000007_1 done copying
      task_200805271056_0001_m_001248_0 output from worker8.

      1. logs.tar.gz
        2.19 MB
        Meng Mao
      2. logsforHadoopTeam.tar.gz
        269 kB
        Meng Mao

        Activity

        Hide
        rangadi Raghu Angadi added a comment -

        Main difference between 0.15 and 0.17 is that 0.17 needs two threads for each block being written and 0.15 needs one. Do you expect thousands of writes to any datanode?

        Show
        rangadi Raghu Angadi added a comment - Main difference between 0.15 and 0.17 is that 0.17 needs two threads for each block being written and 0.15 needs one. Do you expect thousands of writes to any datanode?
        Hide
        ccgillett Christopher Gillett added a comment -

        That depends on your definition of "writes"...it's essentially a log processing appliation, so it is reading in thousands of lines of log data, and writing back a canonical representation. So we are writing out several hundred thousand lines of text, distributed across all the nodes in the system (i.e. 24 DataNodes in this case).

        Our nagios monitoring shows thousands of open socket connections at the time that things fall apart, almost like connections are being closed properly.

        Show
        ccgillett Christopher Gillett added a comment - That depends on your definition of "writes"...it's essentially a log processing appliation, so it is reading in thousands of lines of log data, and writing back a canonical representation. So we are writing out several hundred thousand lines of text, distributed across all the nodes in the system (i.e. 24 DataNodes in this case). Our nagios monitoring shows thousands of open socket connections at the time that things fall apart, almost like connections are being closed properly.
        Hide
        rangadi Raghu Angadi added a comment -

        There is no file descriptor or socket leak we are aware of in 0.17.0. Any more detailed information you can provide will help us to see if there is a bug or to check whats happening in your case. Couple of things you could do :

        1. Run jstack on a datanode when you suspect it is in bad state.
        2. Attach datanode log from a node that exhibited this problem.
        3. Attach corresponding NameNode log.
        4. etc.
        Show
        rangadi Raghu Angadi added a comment - There is no file descriptor or socket leak we are aware of in 0.17.0. Any more detailed information you can provide will help us to see if there is a bug or to check whats happening in your case. Couple of things you could do : Run jstack on a datanode when you suspect it is in bad state. Attach datanode log from a node that exhibited this problem. Attach corresponding NameNode log. etc.
        Hide
        mmao Meng Mao added a comment -

        This file is a set of logs that records the 'incident'

        2763033 2008-09-27 02:28 hadoop-vmc-datanode-overdrive1-node-worker284.log
        1833871 2008-09-27 02:28 hadoop-vmc-jobtracker-overdrive1-node-worker283.log
        18003026 2008-09-27 02:28 hadoop-vmc-namenode-overdrive1-node-worker283.log
        1548787 2008-09-27 02:28 hadoop-vmc-tasktracker-overdrive1-node-worker284.log
        4187144 2008-09-27 02:28 jstack-logs-overdrive1-node-worker283
        7706321 2008-09-27 02:28 jstack-logs-overdrive1-node-worker284
        87643 2008-09-27 02:28 nohup.out
        482223 2008-09-27 02:28 sockets-logs-overdrive1-node-worker283
        1233900 2008-09-27 02:28 sockets-logs-overdrive1-node-worker284

        nohup.out shows that the blowup happens around 1:11 am. Every other log shows some problem around this time.

        The jstack-logs are sampled every 1 minute of the DataNode process for slaves, and for NameNode on the master. The sockets-logs are sampled every 30s for just java sockets.

        I have only included 1 slave as a sample. I can throw up the other 5 if needed.

        Show
        mmao Meng Mao added a comment - This file is a set of logs that records the 'incident' 2763033 2008-09-27 02:28 hadoop-vmc-datanode-overdrive1-node-worker284.log 1833871 2008-09-27 02:28 hadoop-vmc-jobtracker-overdrive1-node-worker283.log 18003026 2008-09-27 02:28 hadoop-vmc-namenode-overdrive1-node-worker283.log 1548787 2008-09-27 02:28 hadoop-vmc-tasktracker-overdrive1-node-worker284.log 4187144 2008-09-27 02:28 jstack-logs-overdrive1-node-worker283 7706321 2008-09-27 02:28 jstack-logs-overdrive1-node-worker284 87643 2008-09-27 02:28 nohup.out 482223 2008-09-27 02:28 sockets-logs-overdrive1-node-worker283 1233900 2008-09-27 02:28 sockets-logs-overdrive1-node-worker284 nohup.out shows that the blowup happens around 1:11 am. Every other log shows some problem around this time. The jstack-logs are sampled every 1 minute of the DataNode process for slaves, and for NameNode on the master. The sockets-logs are sampled every 30s for just java sockets. I have only included 1 slave as a sample. I can throw up the other 5 if needed.
        Hide
        mmao Meng Mao added a comment -

        has anyone had a chance to take a look at the logs? Is there any other info I could collect that would make analysis easier? I'm going to plunge in and instrument the areas that seem to be affected by this bug in hopes of figuring the problem out. Any hints or advice on what to proceed with?

        Show
        mmao Meng Mao added a comment - has anyone had a chance to take a look at the logs? Is there any other info I could collect that would make analysis easier? I'm going to plunge in and instrument the areas that seem to be affected by this bug in hopes of figuring the problem out. Any hints or advice on what to proceed with?
        Hide
        hairong Hairong Kuang added a comment -

        The logs you posted are too much information. Could you please post the following information:
        1. the the log of the task that failed;
        2. The log and the jstack of the problematic datanode around the task failure time.

        Show
        hairong Hairong Kuang added a comment - The logs you posted are too much information. Could you please post the following information: 1. the the log of the task that failed; 2. The log and the jstack of the problematic datanode around the task failure time.
        Hide
        mmao Meng Mao added a comment -

        Files in this tar are:

        namenode/:
        rw-rr- 1 mmao visible 4187144 2008-09-27 01:50 jstack-logs-overdrive1-node-worker83
        rw------ 1 mmao visible 7551 2008-09-30 18:32 nohup.out
        rw-rr- 1 mmao visible 9191 2008-09-30 18:31 overdrive1-node-worker83_1222489161710_job_200809270019_0001_vmc_CommonLogCreate
        rw-rr- 1 mmao visible 34979 2008-09-30 18:30 sockets-logs-overdrive1-node-worker83

        overdrive1-node-worker84/:
        rw-rr- 1 mmao visible 29627 2008-09-30 18:27 hadoop-vmc-datanode-overdrive1-node-worker84.log
        rw-rr- 1 mmao visible 4719 2008-09-30 18:19 hadoop-vmc-tasktracker-overdrive1-node-worker84.log
        rw-rr- 1 mmao visible 425741 2008-09-30 18:16 jstack-logs-overdrive1-node-worker84
        rw-rr- 1 mmao visible 78726 2008-09-30 18:17 sockets-logs-overdrive1-node-worker84

        overdrive1-node-worker87/:
        rw-rr- 1 mmao visible 19227 2008-09-30 18:24 hadoop-vmc-datanode-overdrive1-node-worker87.log
        rw-rr- 1 mmao visible 7287 2008-09-30 18:11 hadoop-vmc-tasktracker-overdrive1-node-worker87.log
        rw-rr- 1 mmao visible 95278 2008-09-30 18:22 sockets-logs-overdrive1-node-worker87

        every log is edited to be centered right around the start of the phenomenon. 2 slave nodes' logs have been provided, with logs for failed tasks. Note that all six of the slaves experience the same proliferation of sockets and block access failures around the same time; I'm merely leaving those logs out for easy of browsing.
        Jstack and socket count logs are provided for both the namenode and the 2 slaves.

        For 84, the failed tasks logged are: _m_001449_1 and and _r_000001_0
        For 87, the failed task logged is: _m_001444_1

        Let me know if you need more.

        Show
        mmao Meng Mao added a comment - Files in this tar are: namenode/: rw-r r - 1 mmao visible 4187144 2008-09-27 01:50 jstack-logs-overdrive1-node-worker83 rw ------ 1 mmao visible 7551 2008-09-30 18:32 nohup.out rw-r r - 1 mmao visible 9191 2008-09-30 18:31 overdrive1-node-worker83_1222489161710_job_200809270019_0001_vmc_CommonLogCreate rw-r r - 1 mmao visible 34979 2008-09-30 18:30 sockets-logs-overdrive1-node-worker83 overdrive1-node-worker84/: rw-r r - 1 mmao visible 29627 2008-09-30 18:27 hadoop-vmc-datanode-overdrive1-node-worker84.log rw-r r - 1 mmao visible 4719 2008-09-30 18:19 hadoop-vmc-tasktracker-overdrive1-node-worker84.log rw-r r - 1 mmao visible 425741 2008-09-30 18:16 jstack-logs-overdrive1-node-worker84 rw-r r - 1 mmao visible 78726 2008-09-30 18:17 sockets-logs-overdrive1-node-worker84 overdrive1-node-worker87/: rw-r r - 1 mmao visible 19227 2008-09-30 18:24 hadoop-vmc-datanode-overdrive1-node-worker87.log rw-r r - 1 mmao visible 7287 2008-09-30 18:11 hadoop-vmc-tasktracker-overdrive1-node-worker87.log rw-r r - 1 mmao visible 95278 2008-09-30 18:22 sockets-logs-overdrive1-node-worker87 every log is edited to be centered right around the start of the phenomenon. 2 slave nodes' logs have been provided, with logs for failed tasks. Note that all six of the slaves experience the same proliferation of sockets and block access failures around the same time; I'm merely leaving those logs out for easy of browsing. Jstack and socket count logs are provided for both the namenode and the 2 slaves. For 84, the failed tasks logged are: _m_001449_1 and and _r_000001_0 For 87, the failed task logged is: _m_001444_1 Let me know if you need more.
        Hide
        rangadi Raghu Angadi added a comment -

        Good to see you persist through the analysis. It encourages us too.
        Lets trace one block "blk_-4738287144374769594" :

        From datanode-84 :

        writeBlock blk_-4738287144374769594_26925 received exception java.io.IOException: Could not read from stream [...]

        From client log :

        2008-09-27 01:07:06,810 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.2.14.89:50010
        2008-09-27 01:07:06,810 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-4738287144374769594_26925 [...]

        What happened here is that, dn-84 could not read response from dn-89. So we need the corresponding log on dn-89 to see why it failed.

        After we figure out why dn-89 failed, we should think about the issue if the writing should have continued since at least one datanode is ok (dn-84 in this case). Before 0.17 (and may be 0.16) client would just continue to write to dn-84.

        Show
        rangadi Raghu Angadi added a comment - Good to see you persist through the analysis. It encourages us too. Lets trace one block "blk_-4738287144374769594" : From datanode-84 : writeBlock blk_-4738287144374769594_26925 received exception java.io.IOException: Could not read from stream [...] From client log : 2008-09-27 01:07:06,810 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.2.14.89:50010 2008-09-27 01:07:06,810 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-4738287144374769594_26925 [...] What happened here is that, dn-84 could not read response from dn-89. So we need the corresponding log on dn-89 to see why it failed. After we figure out why dn-89 failed, we should think about the issue if the writing should have continued since at least one datanode is ok (dn-84 in this case). Before 0.17 (and may be 0.16) client would just continue to write to dn-84.
        Hide
        mmao Meng Mao added a comment - - edited

        Thanks for your speedy reply, Raghu.

        Very curiously, I can find no mention of that block from logs for datanode 89. Here're the only occurences of it in the logs:

        hadoop-vmc-namenode-overdrive1-node-worker83.log:
        2008-09-27 01:07:06,806 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /MENG/import/common_logs/temporary/_attempt_200809270019_0001_r_000001_0/20071219/12/part-00001. blk-4738287144374769594_26925

        hadoop-vmc-namenode-overdrive1-node-worker83.log:
        2008-09-27 01:07:06,806 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /MENG/import/common_logs/temporary/_attempt_200809270019_0001_r_000001_0/20071219/12/part-00001. blk-4738287144374769594_26925

        overdrive1-node-worker84/userlogs/attempt_200809270019_0001_r_000001_0/syslog:
        2008-09-27 01:07:06,810 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-4738287144374769594_26925

        hadoop-vmc-datanode-overdrive1-node-worker84.log:
        2008-09-27 01:07:06,806 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_-4738287144374769594_26925 src: /10.2.14.84:59159 dest: /10.2.14.84:50010
        2008-09-27 01:07:06,809 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_-4738287144374769594_26925 received exception java.io.IOException: Could not read from stream

        I did double check that the IP ending in .89 does in fact map to datanode 89. Does this mean 89 completely missed/ignored the request to allocate that block?

        Here's what happened on 89 shortly before:
        2008-09-27 01:06:43,389 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.2.14.89:50010, storageID=DS-1223952255-10.2.14.89-50010-1222469402047, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030)
        at java.lang.Thread.run(Thread.java:619)

        2008-09-27 01:06:43,420 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.2.14.89:50010, storageID=DS-1223952255-10.2.14.89-50010-1222469402047, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030)
        at java.lang.Thread.run(Thread.java:619)
        (repeated for about 80 times)

        These "258 exceeds the limit of concurrent xcievers 256" errors suddenly appear across the datanodes at around 01:07, with the very first appearing on datanode 89 at the time shown above. Counts of the errors across the datanodes:
        84: 80
        85: 0
        86: 554
        87: 84
        88: 50
        89: 83

        The 3 waves of occurrences of these errors seem to coincide with the 3 spikes in socket counts we've observed before the job totally craps out and fails.

        Show
        mmao Meng Mao added a comment - - edited Thanks for your speedy reply, Raghu. Very curiously, I can find no mention of that block from logs for datanode 89. Here're the only occurences of it in the logs: hadoop-vmc-namenode-overdrive1-node-worker83.log: 2008-09-27 01:07:06,806 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /MENG/import/common_logs/ temporary/_attempt_200809270019_0001_r_000001_0/20071219/12/part-00001. blk -4738287144374769594_26925 hadoop-vmc-namenode-overdrive1-node-worker83.log: 2008-09-27 01:07:06,806 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /MENG/import/common_logs/ temporary/_attempt_200809270019_0001_r_000001_0/20071219/12/part-00001. blk -4738287144374769594_26925 overdrive1-node-worker84/userlogs/attempt_200809270019_0001_r_000001_0/syslog: 2008-09-27 01:07:06,810 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-4738287144374769594_26925 hadoop-vmc-datanode-overdrive1-node-worker84.log: 2008-09-27 01:07:06,806 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_-4738287144374769594_26925 src: /10.2.14.84:59159 dest: /10.2.14.84:50010 2008-09-27 01:07:06,809 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_-4738287144374769594_26925 received exception java.io.IOException: Could not read from stream I did double check that the IP ending in .89 does in fact map to datanode 89. Does this mean 89 completely missed/ignored the request to allocate that block? Here's what happened on 89 shortly before: 2008-09-27 01:06:43,389 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.2.14.89:50010, storageID=DS-1223952255-10.2.14.89-50010-1222469402047, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) 2008-09-27 01:06:43,420 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.2.14.89:50010, storageID=DS-1223952255-10.2.14.89-50010-1222469402047, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) (repeated for about 80 times) These "258 exceeds the limit of concurrent xcievers 256" errors suddenly appear across the datanodes at around 01:07, with the very first appearing on datanode 89 at the time shown above. Counts of the errors across the datanodes: 84: 80 85: 0 86: 554 87: 84 88: 50 89: 83 The 3 waves of occurrences of these errors seem to coincide with the 3 spikes in socket counts we've observed before the job totally craps out and fails.
        Hide
        rangadi Raghu Angadi added a comment -

        > These "258 exceeds the limit of concurrent xcievers 256" errors suddenly appear across the datanodes at around 01:07, with the very first appearing on datanode 89 at the time shown above. Counts of the errors across the datanodes:

        This is the main problem. I really don't like this limit, its too too low for most users. Please set "dfs.datanode.max.xcievers" to something like 2k and run. See HADOOP-3859 and HADOOP-3633 for more info if you need more background.

        Show
        rangadi Raghu Angadi added a comment - > These "258 exceeds the limit of concurrent xcievers 256" errors suddenly appear across the datanodes at around 01:07, with the very first appearing on datanode 89 at the time shown above. Counts of the errors across the datanodes: This is the main problem. I really don't like this limit, its too too low for most users. Please set "dfs.datanode.max.xcievers" to something like 2k and run. See HADOOP-3859 and HADOOP-3633 for more info if you need more background.
        Hide
        rangadi Raghu Angadi added a comment -

        From the description:

        The interesting part here is that if I count the number of JavaThreads
        running org.apache.hadoop.dfs.DataNode I see 4,538 in the
        traceback. The number of threads was surprising.

        So even after you increase the threads to 2k, you might still hit the issue. You need to find out why so many threads are present. One possibility is that you could have many readers or writers, by mistake.

        Show
        rangadi Raghu Angadi added a comment - From the description: The interesting part here is that if I count the number of JavaThreads running org.apache.hadoop.dfs.DataNode I see 4,538 in the traceback. The number of threads was surprising. So even after you increase the threads to 2k, you might still hit the issue. You need to find out why so many threads are present. One possibility is that you could have many readers or writers, by mistake.
        Hide
        mmao Meng Mao added a comment - - edited

        whoops I see the new comments now.

        Show
        mmao Meng Mao added a comment - - edited whoops I see the new comments now.
        Hide
        ccgillett Christopher Gillett added a comment -

        Raghu:

        Intersting about the number of threads...how would lead to a "too many readers or writers" condition?

        I'm asking only because the code that fails is the same in both 0.15 and 0.18, and while it works in 0.15 it fails in 0.18.

        Show
        ccgillett Christopher Gillett added a comment - Raghu: Intersting about the number of threads...how would lead to a "too many readers or writers" condition? I'm asking only because the code that fails is the same in both 0.15 and 0.18, and while it works in 0.15 it fails in 0.18.
        Hide
        rangadi Raghu Angadi added a comment -

        On 0.15, the situation might not be bad enough. Mostly likely you have many (thousands of) simultaneous writers. 0.18 takes takes 2 threads for each write and 0.15 takes one at the datanode (yes, it should be improved). Another important difference is that client on 0.15 collects one block of data (64MB) on local disk and then writes the block as fast as possible to datanodes. So if your code is writing data slowly, it does not affect the threads on datanodes (ie it is not proportional to number of files open for writing).. but since 0.16, #threads is proportional to #files_bing_written. All these issues should be fixed by HADOOP-3856.. please voice your support there.

        You could try doing some back-of-the envolop calculations about how many files are being written at a given time: say you have x files being written and you have 'd' datanodes, then each datanode will have x*6/d threads at each datanode involved in writing (for default replication of 3).

        Show
        rangadi Raghu Angadi added a comment - On 0.15, the situation might not be bad enough. Mostly likely you have many (thousands of) simultaneous writers. 0.18 takes takes 2 threads for each write and 0.15 takes one at the datanode (yes, it should be improved). Another important difference is that client on 0.15 collects one block of data (64MB) on local disk and then writes the block as fast as possible to datanodes. So if your code is writing data slowly, it does not affect the threads on datanodes (ie it is not proportional to number of files open for writing).. but since 0.16, #threads is proportional to #files_bing_written. All these issues should be fixed by HADOOP-3856 .. please voice your support there. You could try doing some back-of-the envolop calculations about how many files are being written at a given time: say you have x files being written and you have 'd' datanodes, then each datanode will have x*6/d threads at each datanode involved in writing (for default replication of 3).
        Hide
        mmao Meng Mao added a comment -

        Just for clarity, when you say each write, you mean any time a map or reduce task emits a chunk to be written somewhere, and this write request is handled by a datanode, right?

        So currently, from inspection, we have N = 4500, a few thousand DataNode related threads per node. Let's just say that 3000 of these are outbound write request threads, and 1500 listening threads for the datanode to write with.

        Which number (if any) is dfs.datanode.max.xcievers imposing a cap on?: 4500, 3000, or 1500?
        From my best parse of the discussion on 3633, it's a cap on 4500?

        What is the bottleneck that is first met when raising the dfs.datanode.max.xcievers limit? JVM memory?

        Show
        mmao Meng Mao added a comment - Just for clarity, when you say each write, you mean any time a map or reduce task emits a chunk to be written somewhere, and this write request is handled by a datanode, right? So currently, from inspection, we have N = 4500, a few thousand DataNode related threads per node. Let's just say that 3000 of these are outbound write request threads, and 1500 listening threads for the datanode to write with. Which number (if any) is dfs.datanode.max.xcievers imposing a cap on?: 4500, 3000, or 1500? From my best parse of the discussion on 3633, it's a cap on 4500? What is the bottleneck that is first met when raising the dfs.datanode.max.xcievers limit? JVM memory?
        Hide
        mmao Meng Mao added a comment -

        I was able to increase the size of my input data (while maintaining a very large dfs.datanode.max.xcievers) until a new kind of error cropped up. While the number of datanode threads does not exceed the limit, a similar DOS-style crash happens to the grid. This time, some of the nodes' datanode processes die and take all the child threads with them. Is it safe to assume that the JVM is running out of memory trying to open up and maintain so many threads? Do you guys know what is "beyond the wall," as far as how Hadoop will fail if dfs.datanode.max.xcievers is jacked up arbitrarily high?

        We had really hoped that, hack as it might be, that raising dfs.datanode.max.xcievers would be sufficient to cover the upper bound of the data we'd expect to see in a scaled out deployment of our grid, but the data set we used is not really pushing we expect to get.

        Regardless of whether our implementation is too naive and should be revised, does our particular problem case lend more weight to the need to do a rewrite as discussed in issue 3856? Or is the general stance to suck it up and work around the limit?

        We greatly appreciate any input you guys have, as this is seriously holding up our project.

        Show
        mmao Meng Mao added a comment - I was able to increase the size of my input data (while maintaining a very large dfs.datanode.max.xcievers) until a new kind of error cropped up. While the number of datanode threads does not exceed the limit, a similar DOS-style crash happens to the grid. This time, some of the nodes' datanode processes die and take all the child threads with them. Is it safe to assume that the JVM is running out of memory trying to open up and maintain so many threads? Do you guys know what is "beyond the wall," as far as how Hadoop will fail if dfs.datanode.max.xcievers is jacked up arbitrarily high? We had really hoped that, hack as it might be, that raising dfs.datanode.max.xcievers would be sufficient to cover the upper bound of the data we'd expect to see in a scaled out deployment of our grid, but the data set we used is not really pushing we expect to get. Regardless of whether our implementation is too naive and should be revised, does our particular problem case lend more weight to the need to do a rewrite as discussed in issue 3856? Or is the general stance to suck it up and work around the limit? We greatly appreciate any input you guys have, as this is seriously holding up our project.
        Hide
        cutting Doug Cutting added a comment -

        > Is it safe to assume that the JVM is running out of memory trying to open up and maintain so many threads?

        You could be running out of file handles?

        Show
        cutting Doug Cutting added a comment - > Is it safe to assume that the JVM is running out of memory trying to open up and maintain so many threads? You could be running out of file handles?
        Hide
        rangadi Raghu Angadi added a comment -

        Do you have approximate values for 'x' and 'd' mentioned in my comment above?

        When you have 3500+ threads you could be running out of different kinds of resources : memory, kernel memory, or ability create any new threads etc.

        Show
        rangadi Raghu Angadi added a comment - Do you have approximate values for 'x' and 'd' mentioned in my comment above? When you have 3500+ threads you could be running out of different kinds of resources : memory, kernel memory, or ability create any new threads etc.
        Hide
        mmao Meng Mao added a comment -

        > Do you have approximate values for 'x' and 'd' mentioned in my comment above?
        Curiously, d (number of datanodes) doesn't seem to matter.
        I've tested 3 sets of test input data on 2 grids, sized 6 and 64 datanodes.
        For each input set, the same peak number of DataNode threads are spawned per machine, seemingly independent of the number of datanodes participating.

        x (part files appearing in HDFS) is about 4000+ for the largest successful dataset that completes successfully, and is a few thousand more for the larger, unsuccessful run. Is it unreasonable to have this magnitude of files in flight in the course of a map reduce operation?

        I should note that the correlation between open files and datanode write threads is expectedly tight. In the two successful test runs, (peak datanode threads, open files) was:
        (178, 144)
        (3900, 4000+)
        And that's for both grid sizes.

        Show
        mmao Meng Mao added a comment - > Do you have approximate values for 'x' and 'd' mentioned in my comment above? Curiously, d (number of datanodes) doesn't seem to matter. I've tested 3 sets of test input data on 2 grids, sized 6 and 64 datanodes. For each input set, the same peak number of DataNode threads are spawned per machine, seemingly independent of the number of datanodes participating. x (part files appearing in HDFS) is about 4000+ for the largest successful dataset that completes successfully, and is a few thousand more for the larger, unsuccessful run. Is it unreasonable to have this magnitude of files in flight in the course of a map reduce operation? I should note that the correlation between open files and datanode write threads is expectedly tight. In the two successful test runs, (peak datanode threads, open files) was: (178, 144) (3900, 4000+) And that's for both grid sizes.
        Hide
        bshi Bo Shi added a comment -

        To clarify the last portion of Meng's note,

        (178, 144) <-- dataset A
        (3900, 4000+) <-- dataset B (3x size of A)

        Show
        bshi Bo Shi added a comment - To clarify the last portion of Meng's note, (178, 144) <-- dataset A (3900, 4000+) <-- dataset B (3x size of A)
        Hide
        mmao Meng Mao added a comment -

        as for the open files, sysctl reports fs.file-max = 1573091, which seems pretty high, and ulimit is unlimited. I'm not aware of other file limit settings that might be involved.

        Show
        mmao Meng Mao added a comment - as for the open files, sysctl reports fs.file-max = 1573091, which seems pretty high, and ulimit is unlimited. I'm not aware of other file limit settings that might be involved.
        Hide
        rangadi Raghu Angadi added a comment -

        For each input set, the same peak number of DataNode threads are spawned per machine, seemingly independent of the number of datanodes participating.

        This is surprising. Assuming everything else is same (number of processes writing files and number of files being written), this seems impossible.

        How many nodes are actually writing? Do these processes run on the same machines as datanodes?

        Show
        rangadi Raghu Angadi added a comment - For each input set, the same peak number of DataNode threads are spawned per machine, seemingly independent of the number of datanodes participating. This is surprising. Assuming everything else is same (number of processes writing files and number of files being written), this seems impossible. How many nodes are actually writing? Do these processes run on the same machines as datanodes?
        Hide
        mmao Meng Mao added a comment -

        I have confirmed it at least a couple times. My measurement for number of threads active per DataNode is:
        /usr/java/default/bin/jps | grep DataNode | cut -f1 -d' ' | xargs /usr/java/default/bin/jstack | grep ' daemon ' | wc
        basically count the number of threads reported by jstack for the DataNode process.

        Is this too coarse or inaccurate a measurement?

        How many nodes are actually writing?

        Not sure how I can measure how many are writing? Is it possible for a node to open a write operation and write nothing over to the file?

        Do these processes run on the same machines as datanodes?

        I'm not sure to which processes you're referring, but I am doing that command on the datanode machines of each grid. For example, on the 6-node grid, you see about 4600+ threads at the most intensive processing moment in the reduce phase on each of the 6 datanodes. And on the 64-node grid, you also see about 4600 threads at the same moment on each of the datanodes. I think I have the logs that show this but I'm pretty sure I'm not mistaken.

        Show
        mmao Meng Mao added a comment - I have confirmed it at least a couple times. My measurement for number of threads active per DataNode is: /usr/java/default/bin/jps | grep DataNode | cut -f1 -d' ' | xargs /usr/java/default/bin/jstack | grep ' daemon ' | wc basically count the number of threads reported by jstack for the DataNode process. Is this too coarse or inaccurate a measurement? How many nodes are actually writing? Not sure how I can measure how many are writing? Is it possible for a node to open a write operation and write nothing over to the file? Do these processes run on the same machines as datanodes? I'm not sure to which processes you're referring, but I am doing that command on the datanode machines of each grid. For example, on the 6-node grid, you see about 4600+ threads at the most intensive processing moment in the reduce phase on each of the 6 datanodes. And on the 64-node grid, you also see about 4600 threads at the same moment on each of the datanodes. I think I have the logs that show this but I'm pretty sure I'm not mistaken.
        Hide
        ccgillett Christopher Gillett added a comment -

        Further to this, our configuration is that all slaves run both TaskTracker and DataNode processes, so they are doing double duty.

        Show
        ccgillett Christopher Gillett added a comment - Further to this, our configuration is that all slaves run both TaskTracker and DataNode processes, so they are doing double duty.
        Hide
        rangadi Raghu Angadi added a comment -

        Better idea on what your maps and reducers do will help. Others here can give more ideas if you give relevant info like number of maps, number of reducers, what maps do approximately.. etc.

        If you have same number of threads with 6 or 64 nodes for the same job, there is certainly something unexpected happening. May be each map is writing a fixed number of files and you have different number of maps on clusters with 6 or 64 nodes (so 'x' is different in each case).

        Show
        rangadi Raghu Angadi added a comment - Better idea on what your maps and reducers do will help. Others here can give more ideas if you give relevant info like number of maps, number of reducers, what maps do approximately.. etc. If you have same number of threads with 6 or 64 nodes for the same job, there is certainly something unexpected happening. May be each map is writing a fixed number of files and you have different number of maps on clusters with 6 or 64 nodes (so 'x' is different in each case).
        Hide
        bshi Bo Shi added a comment -

        Sorry for the long silence on this. We've been heads-down and haven't had a chance to revisit.

        We believe what is happening is that on a moderately sized dataset (>1 billion records) which is processed by a hadoop job that produces multiple output files (using something like MultipleOutputFormat, although we rolled our own functionality in 0.15.0), we encountered this error because we were writing to too many output files [1] causing a large spike in the # of threads and eventually making the job non-responsive.

        Recent JIRAs (like HADOOP-3856) related to modifying the data node threads to use asynchronous IO should help in allowing the # of output directories to scale up... but mostly I think this is a case of user-error.

        Please advise on how we should close this out or what new status flag to use for this JIRA.

        [1] we were creating daily/hourly subdirectories for each timestamp in a dataset that spanned 5 months, resulting in an absurd number of output directories.

        Show
        bshi Bo Shi added a comment - Sorry for the long silence on this. We've been heads-down and haven't had a chance to revisit. We believe what is happening is that on a moderately sized dataset (>1 billion records) which is processed by a hadoop job that produces multiple output files (using something like MultipleOutputFormat, although we rolled our own functionality in 0.15.0), we encountered this error because we were writing to too many output files [1] causing a large spike in the # of threads and eventually making the job non-responsive. Recent JIRAs (like HADOOP-3856 ) related to modifying the data node threads to use asynchronous IO should help in allowing the # of output directories to scale up... but mostly I think this is a case of user-error. Please advise on how we should close this out or what new status flag to use for this JIRA. [1] we were creating daily/hourly subdirectories for each timestamp in a dataset that spanned 5 months, resulting in an absurd number of output directories.
        Hide
        ankur Ankur added a comment - - edited

        We are doing something similar (creating hourly/daily) directories that span 6 - 8 months. We just keep last couple of weeks data inflated and consolidate + compress the older data so that the hourly directories are removed and small number of compressed files remain for a day.

        Coming to what you are doing, writing too many output-directories and files is anyway not advisable as its a strain on name server. From what you have told, it sounds like your class extending MultipleOutputFormat can be tuned along with the map-red job to reduce the number of output-file.

        See if you can add a prefix/suffix to your key/value to help you accumulate the output in lesser number of files.

        Show
        ankur Ankur added a comment - - edited We are doing something similar (creating hourly/daily) directories that span 6 - 8 months. We just keep last couple of weeks data inflated and consolidate + compress the older data so that the hourly directories are removed and small number of compressed files remain for a day. Coming to what you are doing, writing too many output-directories and files is anyway not advisable as its a strain on name server. From what you have told, it sounds like your class extending MultipleOutputFormat can be tuned along with the map-red job to reduce the number of output-file. See if you can add a prefix/suffix to your key/value to help you accumulate the output in lesser number of files.
        Hide
        qwertymaniac Harsh J added a comment -

        This has gone stale. We aren't seeing such a problem with the stable versions today.

        Show
        qwertymaniac Harsh J added a comment - This has gone stale. We aren't seeing such a problem with the stable versions today.

          People

          • Assignee:
            Unassigned
            Reporter:
            ccgillett Christopher Gillett
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development