Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-3031

HA: Fix complete() and getAdditionalBlock() RPCs to be idempotent.


    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 2.0.2-alpha
    • Component/s: ha
    • Labels:


      I executed section 3.4 of Todd's HA test plan. https://issues.apache.org/jira/browse/HDFS-1623
      1. A large file upload is started.
      2. While the file is being uploaded, the administrator kills the first NN and performs a failover.
      3. After the file finishes being uploaded, it is verified for correct length and contents.

      For the test, I have a vm_template styx01:/home/schu/centos64-2-5.5.qcow2. styx01 hosted the active NN and styx02 hosted the standby NN.

      In the log files I attached, you can see that on styx01 I began file upload.
      hadoop fs -put centos64-2.5.5.qcow2

      After waiting several seconds, I kill -9'd the active NN on styx01 and manually failed over to the NN on styx02. I ran into exception below. (rest of the stacktrace in the attached file styx01_uploadLargeFile)

      12/02/29 14:12:52 WARN retry.RetryInvocationHandler: A failover has occurred since the start of this method invocation attempt.
      put: Failed on local exception: java.io.EOFException; Host Details : local host is: "styx01.sf.cloudera.com/"; destination host is: ""styx01.sf.cloudera.com"\
      12/02/29 14:12:52 ERROR hdfs.DFSClient: Failed to close file /user/schu/centos64-2-5.5.qcow2.COPYING
      java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: "styx01.sf.cloudera.com/"; destination host is: ""styx01.\
      at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
      at org.apache.hadoop.ipc.Client.call(Client.java:1145)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:188)
      at $Proxy9.addBlock(Unknown Source)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:302)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
      at $Proxy10.addBlock(Unknown Source)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1097)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:973)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:455)
      Caused by: java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:375)
      at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:830)
      at org.apache.hadoop.ipc.Client$Connection.run(Client.java:762)


        1. styx01_killNNfailover
          4 kB
          Stephen Chu
        2. styx01_uploadLargeFile
          7 kB
          Stephen Chu
        3. hdfs-3031.txt
          13 kB
          Todd Lipcon
        4. hdfs-3031.txt
          22 kB
          Todd Lipcon
        5. hdfs-3031.txt
          21 kB
          Todd Lipcon
        6. hdfs-3031.txt
          27 kB
          Todd Lipcon
        7. hdfs-3031.txt
          27 kB
          Aaron T. Myers
        8. hdfs-3031.txt
          27 kB
          Aaron T. Myers

          Issue Links



              • Assignee:
                tlipcon Todd Lipcon
                schu Stephen Chu
              • Votes:
                0 Vote for this issue
                8 Start watching this issue


                • Created: