Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-278

Should DFS outputstream's close wait forever?

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently DFSOutputStream.close() waits for ever if Namenode keeps throwing NotYetReplicated exception, for whatever reason. Its pretty annoying for a user. Shoud the loop inside close have a timeout? If so how much? It could probably something like 10 minutes.

      1. softMount9.txt
        7 kB
        dhruba borthakur
      2. softMount8.txt
        7 kB
        dhruba borthakur
      3. softMount7.txt
        4 kB
        dhruba borthakur
      4. softMount6.txt
        4 kB
        dhruba borthakur
      5. softMount5.txt
        7 kB
        dhruba borthakur
      6. softMount4.txt
        7 kB
        dhruba borthakur
      7. softMount3.patch
        8 kB
        dhruba borthakur
      8. softMount2.patch
        9 kB
        dhruba borthakur
      9. softMount1.patch
        8 kB
        dhruba borthakur
      10. softMount1.patch
        9 kB
        dhruba borthakur

        Issue Links

          Activity

          Raghu Angadi created issue -
          Raghu Angadi made changes -
          Field Original Value New Value
          Component/s dfs [ 12310710 ]
          Description
          Currently {{DFSOutputStream.close()}} waits for ever if Namenode keeps throwing {{NotYetReplicated}} exception, for whatever reason. Its pretty annoying for a user. Shoud the loop inside close have a timeout? If so how much? It could probably something like 10 minutes.
          Currently {{DFSOutputStream.close()}} waits for ever if Namenode keeps throwing {{NotYetReplicated}} exception, for whatever reason. Its pretty annoying for a user. Shoud the loop inside close have a timeout? If so how much? It could probably something like 10 minutes.
          Hide
          Lohit Vijayarenu added a comment -

          When DFS client commands are used along within other hadoop commands say job submission within a script. Then infinite waiting from DFSClient causes the whole script to hang. It would be really useful to have timeout and report user of failure.

          Show
          Lohit Vijayarenu added a comment - When DFS client commands are used along within other hadoop commands say job submission within a script. Then infinite waiting from DFSClient causes the whole script to hang. It would be really useful to have timeout and report user of failure.
          dhruba borthakur made changes -
          Assignee dhruba borthakur [ dhruba ]
          Hide
          dhruba borthakur added a comment -

          I would like to provide a test case that demonstrates this behaviour. And also a fix that makes amy dfsclient methods return error if the dfsclient was unable to communicate with the dfs servers for a configured period of time.

          In the current implementation, the dfs client is like a nfs hard-mount. It hangs when the servers die. This is a problem for most long-running applications that use hdfs to store data. I would implement an option that will make the dfs client behave like a soft nfs mount..

          Show
          dhruba borthakur added a comment - I would like to provide a test case that demonstrates this behaviour. And also a fix that makes amy dfsclient methods return error if the dfsclient was unable to communicate with the dfs servers for a configured period of time. In the current implementation, the dfs client is like a nfs hard-mount. It hangs when the servers die. This is a problem for most long-running applications that use hdfs to store data. I would implement an option that will make the dfs client behave like a soft nfs mount..
          Hide
          dhruba borthakur added a comment -

          This patch introduces a configuration parameter called dfs.leaserenewal.timeout. If the dfsclient is unable to contact the dfs servers for this much time, then the dfsclient marks all open stream as closed and release all resources associated with those streams. Any operation on one of these streams results in an exception to the application.

          Show
          dhruba borthakur added a comment - This patch introduces a configuration parameter called dfs.leaserenewal.timeout. If the dfsclient is unable to contact the dfs servers for this much time, then the dfsclient marks all open stream as closed and release all resources associated with those streams. Any operation on one of these streams results in an exception to the application.
          dhruba borthakur made changes -
          Attachment softMount1.patch [ 12405605 ]
          Hide
          dhruba borthakur added a comment -

          Enhances the unit test to print more debug statements.

          Show
          dhruba borthakur added a comment - Enhances the unit test to print more debug statements.
          dhruba borthakur made changes -
          Attachment softMount1.patch [ 12405626 ]
          Hide
          dhruba borthakur added a comment -

          Can some kind soul please review this patch? This patch is essential to make non-map-reduce type of jobs run on HDFS.

          Show
          dhruba borthakur added a comment - Can some kind soul please review this patch? This patch is essential to make non-map-reduce type of jobs run on HDFS.
          Hide
          Hairong Kuang added a comment -

          I do not think this patch solves the infinite loop problem with close. It seems to me the patch targets for a different problem. Do you know why your client hang? Is it possible that your client hit HADOOP-4703?

          Show
          Hairong Kuang added a comment - I do not think this patch solves the infinite loop problem with close. It seems to me the patch targets for a different problem. Do you know why your client hang? Is it possible that your client hit HADOOP-4703 ?
          Hide
          dhruba borthakur added a comment -

          I am testing the case when client's hang if the server(s) go down. If the namenode goes down, the dfsclient fails to renew leases and marks clientRunning as false. This should cause close() to bail out.

          In the case when all the datanode(s) in a pipeline go down but the namenode is alive, the close still hangs. is this the case you are referring to?

          Show
          dhruba borthakur added a comment - I am testing the case when client's hang if the server(s) go down. If the namenode goes down, the dfsclient fails to renew leases and marks clientRunning as false. This should cause close() to bail out. In the case when all the datanode(s) in a pipeline go down but the namenode is alive, the close still hangs. is this the case you are referring to?
          Hide
          dhruba borthakur added a comment -

          I took Hairong's suggestion and made file close adhere to the softMount period too.

          Show
          dhruba borthakur added a comment - I took Hairong's suggestion and made file close adhere to the softMount period too.
          dhruba borthakur made changes -
          Attachment softMount2.patch [ 12406462 ]
          Hide
          Hairong Kuang added a comment -

          > In the case when all the datanode(s) in a pipeline go down but the namenode is alive, the close still hangs. is this the case you are referring to?
          Yes, if the dfs client hang is caused by down datanodes, then HADOOP-4703 should fix the problem. Could you please doublecheck if this is true?

          Show
          Hairong Kuang added a comment - > In the case when all the datanode(s) in a pipeline go down but the namenode is alive, the close still hangs. is this the case you are referring to? Yes, if the dfs client hang is caused by down datanodes, then HADOOP-4703 should fix the problem. Could you please doublecheck if this is true?
          Hide
          dhruba borthakur added a comment -

          Good point about HADOOP-4703. I think the patch for HADOOP-4703 helps when the datanode(s) in the pipeline are dead. I am actually using 0.19, so it already has the fix for HADOOP_4703.

          This patch implements a soft-mount-type of client. If the cluster is behaving badly (datanodes down and/or namenode also down) then the client will never hang. The client calls will return error and the appliication can handle it according to its needs. This patch is necessary when the app wants to handle real-timish loads.

          Show
          dhruba borthakur added a comment - Good point about HADOOP-4703 . I think the patch for HADOOP-4703 helps when the datanode(s) in the pipeline are dead. I am actually using 0.19, so it already has the fix for HADOOP_4703. This patch implements a soft-mount-type of client. If the cluster is behaving badly (datanodes down and/or namenode also down) then the client will never hang. The client calls will return error and the appliication can handle it according to its needs. This patch is necessary when the app wants to handle real-timish loads.
          Hide
          Hairong Kuang added a comment -

          This patch only works when the client has a writer and NN hang. If the client does not have a writer or if the client has a write but datanodes in the pipelien hang, this patch does not solve the problem. Is this true? Maybe it makes sense to do it in RPC clients. Currently a RPC client sends a ping to a server if it does not get a reply in one minute. The idea you use to handle lease renewal can be used to handle the ping message.

          Show
          Hairong Kuang added a comment - This patch only works when the client has a writer and NN hang. If the client does not have a writer or if the client has a write but datanodes in the pipelien hang, this patch does not solve the problem. Is this true? Maybe it makes sense to do it in RPC clients. Currently a RPC client sends a ping to a server if it does not get a reply in one minute. The idea you use to handle lease renewal can be used to handle the ping message.
          Hide
          dhruba borthakur added a comment -

          > if the client has a write but datanodes in the pipeline hang, this patch does not solve the problem

          You are right.

          > Maybe it makes sense to do it in RPC clients.

          The clients currently use the streaming API to send data to the datanode(s). Are you saying that the client should periodically ping each of the datanode(s) using a RPC?

          Show
          dhruba borthakur added a comment - > if the client has a write but datanodes in the pipeline hang, this patch does not solve the problem You are right. > Maybe it makes sense to do it in RPC clients. The clients currently use the streaming API to send data to the datanode(s). Are you saying that the client should periodically ping each of the datanode(s) using a RPC?
          Hide
          Hairong Kuang added a comment -

          For socket communications, dfs has read/write timeout. If you set read/write timeout to a small number, will this solve your problem?

          Show
          Hairong Kuang added a comment - For socket communications, dfs has read/write timeout. If you set read/write timeout to a small number, will this solve your problem?
          Hide
          dhruba borthakur added a comment -

          You are referring to dfs.datanode.socket.write.timeout. These are configurable parameters and I already set them to an appropriate number, e.g. 20 seconds because I want real-timeish behaviour.

          If all the datanode(s) in the pipeline die, then the client detects an error and aborts. That is intended behaviour. If one datanode is not really dead (but hangs), then the client will hang too. This patch does not fix that problem.

          The main motivation for this patch is to detect namenode failures early. If a client is writing to a block, it might take a while for the block to get filled up.... this time is dependent at the rate at which the client is writing data... if the client is trickling data into the block, it will not experience the dfs.datanode.socket.write.timeout timeout for a while. In the existing code in trunk, the lease recovery thread will detect NN problem after a while but it does nothing to terminate the threads that were writing to the block. The patch does this.

          Show
          dhruba borthakur added a comment - You are referring to dfs.datanode.socket.write.timeout. These are configurable parameters and I already set them to an appropriate number, e.g. 20 seconds because I want real-timeish behaviour. If all the datanode(s) in the pipeline die, then the client detects an error and aborts. That is intended behaviour. If one datanode is not really dead (but hangs), then the client will hang too. This patch does not fix that problem. The main motivation for this patch is to detect namenode failures early. If a client is writing to a block, it might take a while for the block to get filled up.... this time is dependent at the rate at which the client is writing data... if the client is trickling data into the block, it will not experience the dfs.datanode.socket.write.timeout timeout for a while. In the existing code in trunk, the lease recovery thread will detect NN problem after a while but it does nothing to terminate the threads that were writing to the block. The patch does this.
          Hide
          Hairong Kuang added a comment -

          I do not understand the problem that this patch is trying to solve. You claimed that this patch is to provide soft mount feature to dfs. But the major change is made to LeaseChecker.

          1. LeaseChecker does not get to run if there is no writer. So this patch does not solve the problem if a no-writing call to dfs hangs.
          2. In the writing case, if NN hangs, the call renew will hang too. I do not see how this patch can detect the hanging NN.

          That's why I suggest that the change should make to IPC Client when it sends ping to detect if NN is alive after it does not get the response back in time. The disadvantage of my proposal is that it will also effect Map/Reduce.

          Show
          Hairong Kuang added a comment - I do not understand the problem that this patch is trying to solve. You claimed that this patch is to provide soft mount feature to dfs. But the major change is made to LeaseChecker. 1. LeaseChecker does not get to run if there is no writer. So this patch does not solve the problem if a no-writing call to dfs hangs. 2. In the writing case, if NN hangs, the call renew will hang too. I do not see how this patch can detect the hanging NN. That's why I suggest that the change should make to IPC Client when it sends ping to detect if NN is alive after it does not get the response back in time. The disadvantage of my proposal is that it will also effect Map/Reduce.
          Hide
          dhruba borthakur added a comment -

          A configuration parameter ipc.client.rpctimeout that makes an RPC to not hang for more than this period while waiting for a response from the remote end.

          A configuration parameter dfs.softmount.timeout that allows a dfs writer to exit gracefully if the dfsclient is unable to talk to the namenode for this period of time.

          Show
          dhruba borthakur added a comment - A configuration parameter ipc.client.rpctimeout that makes an RPC to not hang for more than this period while waiting for a response from the remote end. A configuration parameter dfs.softmount.timeout that allows a dfs writer to exit gracefully if the dfsclient is unable to talk to the namenode for this period of time.
          dhruba borthakur made changes -
          Attachment softMount3.patch [ 12408429 ]
          Hide
          Hairong Kuang added a comment -

          1. rpc timeout: the patch seems to implement a read timeout not rpc timeout. If we want to implement rpc timeout, the start wait time should set in the beginning of receiveResponse.
          2. if we have rpc timeout, why we still need soft mount timeout in leaseChecker?
          3 I think the check "if (now > last + softMountTimeout) " could easily be true in normal cases if renewFrenquency is set to be the soft mount timeout.
          3. I feel that the meaning of soft mount timeout is not clear maybe because I am not familiar with NFS terms. Does this limit the maximum time of each file system method or does not limit the maximum time for each call to NN? First of all this patch applies this timeout to only two cases: file close and leaseChecker. The file close case seems to limit the maximum time to a filesystem method. The leaseChecker case seems to limit the maximum time for a rpc although not exactly.
          4. In the file close case, would it be better just to limit the number of retires?

          Show
          Hairong Kuang added a comment - 1. rpc timeout: the patch seems to implement a read timeout not rpc timeout. If we want to implement rpc timeout, the start wait time should set in the beginning of receiveResponse. 2. if we have rpc timeout, why we still need soft mount timeout in leaseChecker? 3 I think the check "if (now > last + softMountTimeout) " could easily be true in normal cases if renewFrenquency is set to be the soft mount timeout. 3. I feel that the meaning of soft mount timeout is not clear maybe because I am not familiar with NFS terms. Does this limit the maximum time of each file system method or does not limit the maximum time for each call to NN? First of all this patch applies this timeout to only two cases: file close and leaseChecker. The file close case seems to limit the maximum time to a filesystem method. The leaseChecker case seems to limit the maximum time for a rpc although not exactly. 4. In the file close case, would it be better just to limit the number of retires?
          Hide
          dhruba borthakur added a comment -

          > 1. rpc timeout: the patch seems to implement a read timeout not rpc timeout.

          As an administrator of a cluster, I find it easier to set a time limit for a rpc conection to bail out if it is not receiving response data continuously. I could change it to a true rpcTimeout, but RPCs like "dfsadmin -report" could truly take a long time because the amount of data to be transferred might be huge depending on the size of the cluster. I am comfortable configuring a cluster in such a way that if a rpc client is waiting for more data from the rpc server for more than 30 seconds, then the client can safely assume that the server is non-responsive. This works even for RPCs that have to transfer large amounts of data. Do you agree?

          > 2. if we have rpc timeout, why we still need soft mount timeout in leaseChecker?
          I think we need these two things to be separate. Please see answer to 3a below.

          > 3 I think the check "if (now > last + softMountTimeout) " could easily be true in normal cases if renewFrenquency is set to be the soft mount timeout.
          The code sets renewFrequency to be softMountTimeout/3. So, "if renewFrenquency is set to be the soft mount timeout" cannot happen. But I will modify this portion of code to handle this case better.

          > 3a. I feel that the meaning of soft mount timeout is not clear maybe
          The NFS manual says something like this : " The softmount timeout sets the time the NFS client will wait for a request to complete".
          To make things clearer, this patch keeps two configuration values:
          ipc.client.inactivity.timeout: is the period of inactivity time when a client is waiting for a response".
          dfs.softmount.timeout: the max time a DFSClient will wait for a request to successfully complete
          The ipc.client.inactivity.timeout is set for a single rpc call. The dfs.softmount.timeout applied to FileSystem operations like DFSClient.close().

          > 4. In the file close case, would it be better just to limit the number of retires?
          In fact, I first deployed a version of code in our cluster that specified the max number of retries to be 5. But then, when I was explaining this behaviour to an app-writer who is writing an app on top of hdfs, it was difficult for me to explain what it really means. I found it easier to explain that "this call will not take more than 30 seconds". Also, specifying a "time" is future proof in a sense that a hdfs developer can change the frequency of close-retries without affecting the semantics exposed to the user. If you feel strongly against this one, I can change it, please do let me know.

          thanks for reviewing this one.

          Show
          dhruba borthakur added a comment - > 1. rpc timeout: the patch seems to implement a read timeout not rpc timeout. As an administrator of a cluster, I find it easier to set a time limit for a rpc conection to bail out if it is not receiving response data continuously. I could change it to a true rpcTimeout, but RPCs like "dfsadmin -report" could truly take a long time because the amount of data to be transferred might be huge depending on the size of the cluster. I am comfortable configuring a cluster in such a way that if a rpc client is waiting for more data from the rpc server for more than 30 seconds, then the client can safely assume that the server is non-responsive. This works even for RPCs that have to transfer large amounts of data. Do you agree? > 2. if we have rpc timeout, why we still need soft mount timeout in leaseChecker? I think we need these two things to be separate. Please see answer to 3a below. > 3 I think the check "if (now > last + softMountTimeout) " could easily be true in normal cases if renewFrenquency is set to be the soft mount timeout. The code sets renewFrequency to be softMountTimeout/3. So, "if renewFrenquency is set to be the soft mount timeout" cannot happen. But I will modify this portion of code to handle this case better. > 3a. I feel that the meaning of soft mount timeout is not clear maybe The NFS manual says something like this : " The softmount timeout sets the time the NFS client will wait for a request to complete". To make things clearer, this patch keeps two configuration values: ipc.client.inactivity.timeout: is the period of inactivity time when a client is waiting for a response". dfs.softmount.timeout: the max time a DFSClient will wait for a request to successfully complete The ipc.client.inactivity.timeout is set for a single rpc call. The dfs.softmount.timeout applied to FileSystem operations like DFSClient.close(). > 4. In the file close case, would it be better just to limit the number of retires? In fact, I first deployed a version of code in our cluster that specified the max number of retries to be 5. But then, when I was explaining this behaviour to an app-writer who is writing an app on top of hdfs, it was difficult for me to explain what it really means. I found it easier to explain that "this call will not take more than 30 seconds". Also, specifying a "time" is future proof in a sense that a hdfs developer can change the frequency of close-retries without affecting the semantics exposed to the user. If you feel strongly against this one, I can change it, please do let me know. thanks for reviewing this one.
          Hide
          Hairong Kuang added a comment -

          > As an administrator of a cluster, I find it easier to set a time limit for a rpc conection to bail out if it is not receiving response data continuously.

          I am not sure if all administrators want this. This is going to revert what we did in HADOOP-2188. IPC client already has a configured read timeout. If you do want a timeout on read, maybe it is better to have a configuration setting if the client needs a Ping or not. There is no need to have multiple read timeout configurations.

          Suppose RPC can fail on SocketTimeoutException, why you need a timeout on a single RPC? Why can't the client close the lease on SocketTimeoutException? I think one timeout and a hard retry limitation on close will serve your purpose well. Why we need so many different layers of timeout? Maybe I missed something.

          BTW, the configurations inactivity.timeout and softmount.timeout are not general at all. One is only for leasechecker and anotther is only for close.

          Show
          Hairong Kuang added a comment - > As an administrator of a cluster, I find it easier to set a time limit for a rpc conection to bail out if it is not receiving response data continuously. I am not sure if all administrators want this. This is going to revert what we did in HADOOP-2188 . IPC client already has a configured read timeout. If you do want a timeout on read, maybe it is better to have a configuration setting if the client needs a Ping or not. There is no need to have multiple read timeout configurations. Suppose RPC can fail on SocketTimeoutException, why you need a timeout on a single RPC? Why can't the client close the lease on SocketTimeoutException? I think one timeout and a hard retry limitation on close will serve your purpose well. Why we need so many different layers of timeout? Maybe I missed something. BTW, the configurations inactivity.timeout and softmount.timeout are not general at all. One is only for leasechecker and anotther is only for close.
          Hide
          dhruba borthakur added a comment -

          1. Introduced a new boolean configurable variable called ipc.client.ping that specifies whether the rpc client is supposed to send a periodic ping or not.
          2. A dfs writer bails out if the renewLease() fails once
          3. DFSOutputStream.close() retries the close call as long as it is within the time specified by ipc.ping.interval.

          Show
          dhruba borthakur added a comment - 1. Introduced a new boolean configurable variable called ipc.client.ping that specifies whether the rpc client is supposed to send a periodic ping or not. 2. A dfs writer bails out if the renewLease() fails once 3. DFSOutputStream.close() retries the close call as long as it is within the time specified by ipc.ping.interval.
          dhruba borthakur made changes -
          Attachment softMount4.txt [ 12409919 ]
          Hide
          Kan Zhang added a comment -

          @dhruba, in your new patch, in DFSClient.java, the following added code won't get executed. Are you trying to replace the if clause before it?

          +      if (closed) {
          +        return;
          +      }
          +
          
          Show
          Kan Zhang added a comment - @dhruba, in your new patch, in DFSClient.java, the following added code won't get executed. Are you trying to replace the if clause before it? + if (closed) { + return; + } +
          Hide
          dhruba borthakur added a comment -

          Thanks for the catch, Kan.. That piece of code remained there beucase of recent changes in trunk. I fixed it and uploaded a new patch. Please let me know your review comments.

          Show
          dhruba borthakur added a comment - Thanks for the catch, Kan.. That piece of code remained there beucase of recent changes in trunk. I fixed it and uploaded a new patch. Please let me know your review comments.
          dhruba borthakur made changes -
          Attachment softMount5.txt [ 12410056 ]
          Owen O'Malley made changes -
          Project Hadoop Common [ 12310240 ] HDFS [ 12310942 ]
          Key HADOOP-2757 HDFS-278
          Component/s dfs [ 12310710 ]
          dhruba borthakur made changes -
          Link This issue is blocked by HADOOP-6099 [ HADOOP-6099 ]
          Hide
          dhruba borthakur added a comment -

          merged patch with refactored trunk. This depends on HADOOP-6099.

          Show
          dhruba borthakur added a comment - merged patch with refactored trunk. This depends on HADOOP-6099 .
          dhruba borthakur made changes -
          Attachment softMount6.txt [ 12411512 ]
          Hide
          Konstantin Shvachko added a comment -
          1. dfsTimeout -> hdfsTimeout
          2. "shutting down dfs client" -> "shutting down HDFS client"
            Could also be "the client", just not "dfs".
          3. same with some other (new only) log messages, like "dfs servers".
          4. I see that in LeaseChecker.run() you are trying to close the whole client on any IOException. Two things here:
            1. You close (abort) only the LeaseChecker and the RPC server, but not the DFSClient. Is that a problem?
            2. Did you mean to catch only exceptions related to timeouts? Current behavior is that the checker logs the exception and continues renew attempts.
              May be you wanted to abort LeaseChecker when dfsTimeout expires? I am a bit confused here.
          5. I thought you had a unit test for that, no?
          6. why do you need additional configuration parameter ipc.client.ping?
            Can you just use set ipc.ping.interval = -1 and treat it as no ping?
            Then by default since ipc.ping.interval parameter is unofficial it will still equal 1 min,
            and if you want to turn pings off you will need to specify it. Everything looks consistent with your plan.
            I mean less parameters is better, right?
          Show
          Konstantin Shvachko added a comment - dfsTimeout -> hdfsTimeout "shutting down dfs client" -> "shutting down HDFS client" Could also be "the client", just not "dfs". same with some other (new only) log messages, like "dfs servers". I see that in LeaseChecker.run() you are trying to close the whole client on any IOException . Two things here: You close (abort) only the LeaseChecker and the RPC server, but not the DFSClient . Is that a problem? Did you mean to catch only exceptions related to timeouts? Current behavior is that the checker logs the exception and continues renew attempts. May be you wanted to abort LeaseChecker when dfsTimeout expires? I am a bit confused here. I thought you had a unit test for that, no? why do you need additional configuration parameter ipc.client.ping ? Can you just use set ipc.ping.interval = -1 and treat it as no ping? Then by default since ipc.ping.interval parameter is unofficial it will still equal 1 min, and if you want to turn pings off you will need to specify it. Everything looks consistent with your plan. I mean less parameters is better, right?
          Hide
          dhruba borthakur added a comment -

          Thanks for the review Konstantin. I will work on it and post a new patch soon.

          Show
          dhruba borthakur added a comment - Thanks for the review Konstantin. I will work on it and post a new patch soon.
          Hide
          dhruba borthakur added a comment -

          1. Changed dfsTimeout to hdfsTimeout
          2. Changed "dfs client' to HDFS client.
          3. Changed "dfs server" to "HDFS servers".
          4a. I set clientRunning to false. This essentially means thatthe DFSClient is practically shutdown. I do not call DFSClient.close() because of locking-hierarchy issues.
          4b. I changed the code to abort only if SocketTimeoutException. For other exceptions, the behaviour should be same as before.
          5. I will provide a unit test very shortly.
          6. The ping issue is moved to HADOOP-6099. I have some comments in that JIRA. Please se if u can review it.

          Thanks for reviewing this once again.

          Show
          dhruba borthakur added a comment - 1. Changed dfsTimeout to hdfsTimeout 2. Changed "dfs client' to HDFS client. 3. Changed "dfs server" to "HDFS servers". 4a. I set clientRunning to false. This essentially means thatthe DFSClient is practically shutdown. I do not call DFSClient.close() because of locking-hierarchy issues. 4b. I changed the code to abort only if SocketTimeoutException. For other exceptions, the behaviour should be same as before. 5. I will provide a unit test very shortly. 6. The ping issue is moved to HADOOP-6099 . I have some comments in that JIRA. Please se if u can review it. Thanks for reviewing this once again.
          dhruba borthakur made changes -
          Attachment softMount7.txt [ 12412052 ]
          Hide
          Konstantin Shvachko added a comment -
          1. In (4) I meant that in LeaseChecker.run() if you move clientRunning = false; and RPC.stopProxy() - two lines surronding abort() - if you move them inside abprt, then you will not need the synchronized section around these three lines, because abort() is synchronized already.
          2. It is better to use LOG.warn(message, excpetion) instead LOG.warn(message + excpetion + anotherMessage) since you will se the stack trace in the former case. I see you replaced the original usage by this on 2 occasions.

          +1 other than that.

          Show
          Konstantin Shvachko added a comment - In (4) I meant that in LeaseChecker.run() if you move clientRunning = false; and RPC.stopProxy() - two lines surronding abort() - if you move them inside abprt, then you will not need the synchronized section around these three lines, because abort() is synchronized already. It is better to use LOG.warn(message, excpetion) instead LOG.warn(message + excpetion + anotherMessage) since you will se the stack trace in the former case. I see you replaced the original usage by this on 2 occasions. +1 other than that.
          Hide
          dhruba borthakur added a comment -

          I incorporated both of Konstantin's review comments and added a unit test.

          Show
          dhruba borthakur added a comment - I incorporated both of Konstantin's review comments and added a unit test.
          dhruba borthakur made changes -
          Attachment softMount8.txt [ 12412843 ]
          Hide
          dhruba borthakur added a comment -

          Trigger Hudson QA tests.

          Show
          dhruba borthakur added a comment - Trigger Hudson QA tests.
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12412843/softMount8.txt
          against trunk revision 790733.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/7/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/7/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/7/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12412843/softMount8.txt against trunk revision 790733. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/7/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/7/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/7/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Re-triggring HudsonQA

          Show
          dhruba borthakur added a comment - Re-triggring HudsonQA
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.21.0 [ 12314046 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12412843/softMount8.txt
          against trunk revision 792881.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/16/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/16/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/16/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12412843/softMount8.txt against trunk revision 792881. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/16/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/16/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/16/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          I do not think that Findbugs failing is caused by this patch. If anybody has any clues, please let me know.

          Show
          dhruba borthakur added a comment - I do not think that Findbugs failing is caused by this patch. If anybody has any clues, please let me know.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          +    this.hdfsTimeout = Client.getTimeout(conf);
          

          The line above cannot be compiled since the jar file is not yet updated for HADOOP-6099.

          Show
          Tsz Wo Nicholas Sze added a comment - + this.hdfsTimeout = Client.getTimeout(conf); The line above cannot be compiled since the jar file is not yet updated for HADOOP-6099 .
          Hide
          dhruba borthakur added a comment -

          Thanks for the tip Nicholas. I will compile the latest trunk from common and checkin the files lib/hadoop-core-0.21.0-dev.jar and lib/hadoop-core-test-0.21.0-dev.jar.

          Show
          dhruba borthakur added a comment - Thanks for the tip Nicholas. I will compile the latest trunk from common and checkin the files lib/hadoop-core-0.21.0-dev.jar and lib/hadoop-core-test-0.21.0-dev.jar.
          Hide
          dhruba borthakur added a comment -

          Merged patch with latest trunk

          Show
          dhruba borthakur added a comment - Merged patch with latest trunk
          dhruba borthakur made changes -
          Attachment softMount9.txt [ 12413239 ]
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          dhruba borthakur added a comment -

          Triggering HudsonQA tests

          Show
          dhruba borthakur added a comment - Triggering HudsonQA tests
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          dhruba borthakur added a comment -

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]

          Show
          dhruba borthakur added a comment - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          dhruba borthakur added a comment -

          All unit tests passed except TestHDFSCLI and TestNNThroughputBenchmark.

          Show
          dhruba borthakur added a comment - All unit tests passed except TestHDFSCLI and TestNNThroughputBenchmark.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12413239/softMount9.txt
          against trunk revision 793365.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413239/softMount9.txt against trunk revision 793365. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/18/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          The unit test that failed is TestHDFSCLI and it also fails in hdfs trunk (without this patch). So, this patch is ready to commit.

          Show
          dhruba borthakur added a comment - The unit test that failed is TestHDFSCLI and it also fails in hdfs trunk (without this patch). So, this patch is ready to commit.
          Hide
          dhruba borthakur added a comment -

          I just committed this.

          Show
          dhruba borthakur added a comment - I just committed this.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #23 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/23/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #23 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/23/ )
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          dhruba borthakur made changes -
          Link This issue relates to HDFS-951 [ HDFS-951 ]
          zguoqin made changes -
          Link This issue relates to HADOOP-9564 [ HADOOP-9564 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          4d 6h 5m 2 dhruba borthakur 12/Jul/09 16:40
          Open Open Patch Available Patch Available
          523d 13h 43m 3 dhruba borthakur 12/Jul/09 16:40
          Patch Available Patch Available Resolved Resolved
          14h 57m 1 dhruba borthakur 13/Jul/09 07:38
          Resolved Resolved Closed Closed
          407d 14h 9m 1 Tom White 24/Aug/10 21:47

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              Raghu Angadi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development