Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-720

NPE in BlockReceiver$PacketResponder.run(BlockReceiver.java:923)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: datanode
    • Labels:
      None
    • Environment:

      Description

      Running some loadings on hdfs I had one of these on the DN XX.XX.XX.139:51010:

      2009-10-21 04:57:02,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_6345892463926159834_1029 src: /XX,XX,XX.140:37890 dest: /XX.XX.XX.139:51010
      2009-10-21 04:57:02,829 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_6345892463926159834_1029 1 Exception java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:923)
              at java.lang.Thread.run(Thread.java:619)
      

      On XX,XX,XX.140 side, it looks like this:

      10-21 04:57:01,866 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_6345892463926159834_1029 src: /XX.XX.XX.140:37385 dest: /XX.XX.XX140:51010
      2009-10-21 04:57:02,836 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_6345892463926159834_1029 terminating
      2009-10-21 04:57:02,885 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(XX.XX.XX.140:51010, storageID=DS-1292310101-208.76.44.140-51010-1256100924816, infoPort=51075, ipcPort=51020):Exception writing block blk_6345892463926159834_1029 to mirror XX.XX.XX.139:51010
      java.io.IOException: Connection reset by peer
          at sun.nio.ch.FileDispatcher.write0(Native Method)
          at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
          at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
          at sun.nio.ch.IOUtil.write(IOUtil.java:75)
          at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
          at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
          at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
          at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
          at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
          at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
          at java.io.DataOutputStream.write(DataOutputStream.java:90)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:466)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:434)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:573)
          at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
          at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:382)
          at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:323)
          at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
          at java.lang.Thread.run(Thread.java:619)
      

      Here is the bit of code inside the run method:

       922                   pkt = ackQueue.getFirst();
       923                   expected = pkt.seqno;
      

      So 'pkt' is null? But LinkedList API says that it throws NoSuchElementException if list is empty so you'd think we wouldn't get a NPE here. What am I missing?

      1. dn.log
        17 kB
        stack

        Activity

        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Tsz Wo Nicholas Sze made changes -
        Component/s data-node [ 12312927 ]
        Hide
        Hairong Kuang added a comment -

        Stack, thanks a lot for your tests and effort on resolving the issue!

        Show
        Hairong Kuang added a comment - Stack, thanks a lot for your tests and effort on resolving the issue!
        stack made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 0.21.0 [ 12314046 ]
        Resolution Fixed [ 1 ]
        Hide
        stack added a comment -

        Resolving as fixed by HDFS-690. I just ran my tests with hdfs-690 in place and I no longer see NPEs. Thanks.

        Show
        stack added a comment - Resolving as fixed by HDFS-690 . I just ran my tests with hdfs-690 in place and I no longer see NPEs. Thanks.
        Hide
        Hairong Kuang added a comment -

        Yes, I agree that this is caused by unsynced accesses to the packet queue. Coincidentally I think the patch to HDFS-690 fixes the problem. Stack, could you please give it a try? Thanks.

        Show
        Hairong Kuang added a comment - Yes, I agree that this is caused by unsynced accesses to the packet queue. Coincidentally I think the patch to HDFS-690 fixes the problem. Stack, could you please give it a try? Thanks.
        Hide
        stack added a comment -

        Doing synch around each access to ackQueue gets rid of NPEs.

        Show
        stack added a comment - Doing synch around each access to ackQueue gets rid of NPEs.
        Hide
        stack added a comment -

        I'm trying it (The NPEs are easy to reproduce for me).

        Show
        stack added a comment - I'm trying it (The NPEs are easy to reproduce for me).
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > So 'pkt' is null? But LinkedList API says that it throws NoSuchElementException if list is empty so you'd think we wouldn't get a NPE here. What am I missing?

        One possibility is that LinkedList does not support concurrent accesses. So LinkedList.getFirst() may return null if there is another thread removing list elements. This is my hypothesis.

        Show
        Tsz Wo Nicholas Sze added a comment - > So 'pkt' is null? But LinkedList API says that it throws NoSuchElementException if list is empty so you'd think we wouldn't get a NPE here. What am I missing? One possibility is that LinkedList does not support concurrent accesses. So LinkedList.getFirst() may return null if there is another thread removing list elements. This is my hypothesis.
        Hide
        stack added a comment -

        This might be cleaner example. Its first exception in this DN's log after loading started. Its like we skip/lose a packet because of the NPE?

        ...
        2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7356834145770439479_1586 responded my status  for seqno 895
        2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_-7356834145770439479_1586 of length 65024 seqno 896 offsetInBlock 58025472 lastPacketInBlock false
        2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7356834145770439479_1586 1 responded other status  for seqno 895
        2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 adding seqno 896 to ack queue.
        2009-10-21 18:23:11,324 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 got seqno = 896
        2009-10-21 18:23:11,325 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_-7356834145770439479_1586 of length 65024 seqno 897 offsetInBlock 58090496 lastPacketInBlock false
        2009-10-21 18:23:11,325 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 adding seqno 897 to ack queue.
        2009-10-21 18:23:11,325 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7356834145770439479_1586 1 Exception java.lang.NullPointerException
            at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:923)
            at java.lang.Thread.run(Thread.java:619)
        
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 got seqno = 4546
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 seqno = 4546
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-830254393316092139_1588 responded my status  for seqno 4546
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-830254393316092139_1588 1 responded other status  for seqno 4546
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 got seqno = 4547
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 seqno = 4547
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_-7356834145770439479_1586 of length 65024 seqno 898 offsetInBlock 58155520 lastPacketInBlock false
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-830254393316092139_1588 responded my status  for seqno 4547
        2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-830254393316092139_1588 1 responded other status  for seqno 4547
        2009-10-21 18:23:11,327 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7356834145770439479_1586 responded my status  for seqno -2
        2009-10-21 18:23:11,327 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7356834145770439479_1586 1 responded other status  for seqno -2
        2009-10-21 18:23:11,327 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7356834145770439479_1586 terminating
        ...
        

        We keep going with the download... then this on the finish up:

        2009-10-21 18:23:11,338 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-830254393316092139_1588 responded my status  for seqno 4555
        2009-10-21 18:23:11,338 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 184
        2009-10-21 18:23:11,338 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-830254393316092139_1588 1 responded other status  for seqno 4555
        2009-10-21 18:23:11,339 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock receive buf size 131071 tcp no delay true
        2009-10-21 18:23:11,339 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-7356834145770439479_1586 src: /XX,XX,XX.142:49468 dest: /XX,XX,XX.139:51010
        2009-10-21 18:23:11,339 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Recover the RBW replica blk_-7356834145770439479_1586
        2009-10-21 18:23:11,339 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Recovering replica ReplicaBeingWritten, blk_-7356834145770439479_1586, RBW
          getNumBytes()     = 58415616
          getBytesOnDisk()  = 58415616
          getVisibleLength()= 58025472
          getVolume()       = /d3/stack/dfs.data.dir/current/finalized
          getBlockFile()    = /d3/stack/dfs.data.dir/current/rbw/blk_-7356834145770439479
          bytesAcked=58025472
          bytesOnDisk=58415616
        
        Show
        stack added a comment - This might be cleaner example. Its first exception in this DN's log after loading started. Its like we skip/lose a packet because of the NPE? ... 2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7356834145770439479_1586 responded my status for seqno 895 2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_-7356834145770439479_1586 of length 65024 seqno 896 offsetInBlock 58025472 lastPacketInBlock false 2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7356834145770439479_1586 1 responded other status for seqno 895 2009-10-21 18:23:11,323 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 adding seqno 896 to ack queue. 2009-10-21 18:23:11,324 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 got seqno = 896 2009-10-21 18:23:11,325 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_-7356834145770439479_1586 of length 65024 seqno 897 offsetInBlock 58090496 lastPacketInBlock false 2009-10-21 18:23:11,325 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 adding seqno 897 to ack queue. 2009-10-21 18:23:11,325 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7356834145770439479_1586 1 Exception java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:923) at java.lang. Thread .run( Thread .java:619) 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 got seqno = 4546 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 seqno = 4546 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-830254393316092139_1588 responded my status for seqno 4546 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-830254393316092139_1588 1 responded other status for seqno 4546 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 got seqno = 4547 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 seqno = 4547 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_-7356834145770439479_1586 of length 65024 seqno 898 offsetInBlock 58155520 lastPacketInBlock false 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-830254393316092139_1588 responded my status for seqno 4547 2009-10-21 18:23:11,326 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-830254393316092139_1588 1 responded other status for seqno 4547 2009-10-21 18:23:11,327 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7356834145770439479_1586 responded my status for seqno -2 2009-10-21 18:23:11,327 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7356834145770439479_1586 1 responded other status for seqno -2 2009-10-21 18:23:11,327 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7356834145770439479_1586 terminating ... We keep going with the download... then this on the finish up: 2009-10-21 18:23:11,338 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-830254393316092139_1588 responded my status for seqno 4555 2009-10-21 18:23:11,338 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 184 2009-10-21 18:23:11,338 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-830254393316092139_1588 1 responded other status for seqno 4555 2009-10-21 18:23:11,339 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock receive buf size 131071 tcp no delay true 2009-10-21 18:23:11,339 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-7356834145770439479_1586 src: /XX,XX,XX.142:49468 dest: /XX,XX,XX.139:51010 2009-10-21 18:23:11,339 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Recover the RBW replica blk_-7356834145770439479_1586 2009-10-21 18:23:11,339 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Recovering replica ReplicaBeingWritten, blk_-7356834145770439479_1586, RBW getNumBytes() = 58415616 getBytesOnDisk() = 58415616 getVisibleLength()= 58025472 getVolume() = /d3/stack/dfs.data.dir/current/finalized getBlockFile() = /d3/stack/dfs.data.dir/current/rbw/blk_-7356834145770439479 bytesAcked=58025472 bytesOnDisk=58415616
        stack made changes -
        Field Original Value New Value
        Attachment dn.log [ 12422849 ]
        Hide
        stack added a comment -

        Here you go... shows transfer of blk_-7793125370314584661_1799. We get NPE. Then there is an EOFE that shows in application layer.

        There are other exceptions in these logs but they seem unrelated (Will file these when I figure how to present).

        Show
        stack added a comment - Here you go... shows transfer of blk_-7793125370314584661_1799. We get NPE. Then there is an EOFE that shows in application layer. There are other exceptions in these logs but they seem unrelated (Will file these when I figure how to present).
        Hide
        stack added a comment -

        @Hairong I agree but it seems easy enough to duplicate. I just did a rerun but this time have DEBUG on in DN logs so if you need anything extra just say. Let me add snippet from latest run.

        Show
        stack added a comment - @Hairong I agree but it seems easy enough to duplicate. I just did a rerun but this time have DEBUG on in DN logs so if you need anything extra just say. Let me add snippet from latest run.
        Hide
        Hairong Kuang added a comment -

        Stack, thanks a lot for your tests.

        This bug is very weird. Could you please check your logs to see if there are any other suspicious traces?

        Show
        Hairong Kuang added a comment - Stack, thanks a lot for your tests. This bug is very weird. Could you please check your logs to see if there are any other suspicious traces?
        Hide
        stack added a comment -

        dfs.support.append=true in this setup.

        Show
        stack added a comment - dfs.support.append=true in this setup.
        stack created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development