Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2341

Datanode active connections never returns to 0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Incomplete
    • 0.16.0
    • None
    • None
    • None

    Description

      On trunk i continue to see the following in my data node logs:

      2007-12-03 15:46:47,696 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 42
      2007-12-03 15:46:48,135 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 41
      2007-12-03 15:46:48,439 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 40
      2007-12-03 15:46:48,479 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 39
      2007-12-03 15:46:48,611 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 38
      2007-12-03 15:46:48,898 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 37
      2007-12-03 15:46:48,989 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 36
      2007-12-03 15:46:51,010 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 35
      2007-12-03 15:46:51,758 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 34
      2007-12-03 15:46:52,148 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections is: 33

      This number never returns to 0, even after many hours of no new data being manipulated or added into the DFS.

      Looking at netstat -tn i see significant amount of data in the send-q that never goes away:

      tcp 0 34240 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:55792 ESTABLISHED
      tcp 0 38968 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:38169 ESTABLISHED
      tcp 0 38456 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:35456 ESTABLISHED
      tcp 0 29640 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:59845 ESTABLISHED
      tcp 0 50168 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:44584 ESTABLISHED

      When sniffing the network I see that the remote side (YY.YY.YY.YY) is returning a window size of 0
      16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123 win 46 <nop,nop,timestamp 1786247180 885681789>
      16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0 <nop,nop,timestamp 885801786 1775711351>

      Then we look at the stack traces on each datanode, I will have tons of threads that never go away in the following trace:

      Thread 6516 (org.apache.hadoop.dfs.DataNode$DataXceiver@166068b6):
        State: RUNNABLE
        Blocked count: 0
        Waited count: 0
        Stack:
          java.net.SocketOutputStream.socketWrite0(Native Method)
          java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
          java.net.SocketOutputStream.write(SocketOutputStream.java:136)
          java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
          java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
          java.io.DataOutputStream.write(DataOutputStream.java:90)
          org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1400)
          org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1433)
          org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:904)
          org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
          java.lang.Thread.run(Thread.java:619)
      

      Unfortunately there's very little in the logs with exceptions that could point to this. I have some exceptions the following, but nothing that points to problems between XX and YY:

      2007-12-02 11:19:47,889 WARN  dfs.DataNode - Unexpected error trying to delete block blk_4515246476002110310. Block not found in blockMap. 
      2007-12-02 11:19:47,922 WARN  dfs.DataNode - java.io.IOException: Error in deleting blocks.
              at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:750)
              at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:675)
              at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:569)
              at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1720)
              at java.lang.Thread.run(Thread.java:619)
      

      Attachments

        1. dfsclient.patch
          2 kB
          Michael Stack
        2. hregionserver-stack.txt
          10 kB
          Paul Saab
        3. stacks-YY.YY.YY.YY.txt
          41 kB
          Paul Saab
        4. stacks-XX.XX.XX.XXX.txt
          30 kB
          Paul Saab

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ps0ps Paul Saab
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: