Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10504

DFSClient filesBeingWritten memory leak when client gets RemoteException - could only be replicated to 0 nodes instead of minReplication (=1)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 2.7.2
    • None
    • hdfs-client
    • None
    • linux

    Description

      I'm trying to migrate data from nfs to hdfs. I have about 2million files with small sizes. That takes about 4 hours in my env, but I randomly get an exception during migration. Got 12 of those during the test (stack below).

      Now when I'm getting the exception, I'm doing a sleep for one second, after I check if the file is there (api says yes, but it's reported size is zero bytes). So I'm removing the file, then start writing it again and at that point it succeeds.

      Here is the stack:
      org.apache.hadoop.ipc.RemoteException(java.io.IOException): File xxx/xxx/xxx could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1592)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3158)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)

      at org.apache.hadoop.ipc.Client.call(Client.java:1475)
      at org.apache.hadoop.ipc.Client.call(Client.java:1412)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
      at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
      at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:497)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
      at com.sun.proxy.$Proxy11.addBlock(Unknown Source)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1459)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1255)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)

      When I write I'm using the try with resource which should call close method on the FSDataOutputStream. This triggers the
      dfsClient.endFileLease(fileId) to be called which should remove the ref from:
      DFSClient:
      synchronized(filesBeingWritten) {
      filesBeingWritten.remove(inodeId);
      if (filesBeingWritten.isEmpty())

      { lastLeaseRenewal = 0; }

      }

      But when the process finishes, I get:

      2016-06-07 22:26:54,734 - ERROR [Thread-3] (DFSClient.closeAllFilesBeingWritten:940) - Failed to close inode 1675022
      org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /xxx/xxx/xxx could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1592)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3158)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)

      Now, when there is no space on the datanode, I get this error a lot which causes my migration java client to die with OutOfMemory. The cause is DFSClient.filesBeingWritten taking almost 1GB.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sebyonthenet Seb Mo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: