Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3703

Decrease the datanode failure detection time

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.3, 2.0.0-alpha, 3.0.0
    • Fix Version/s: 1.1.0, 2.0.3-alpha
    • Component/s: datanode, namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This jira adds a new DataNode state called "stale" at the NameNode. DataNodes are marked as stale if it does not send heartbeat message to NameNode within the timeout configured using the configuration parameter "dfs.namenode.stale.datanode.interval" in seconds (default value is 30 seconds). NameNode picks a stale datanode as the last target to read from when returning block locations for reads.

      This feature is by default turned * off *. To turn on the feature, set the HDFS configuration "dfs.namenode.check.stale.datanode" to true.
      Show
      This jira adds a new DataNode state called "stale" at the NameNode. DataNodes are marked as stale if it does not send heartbeat message to NameNode within the timeout configured using the configuration parameter "dfs.namenode.stale.datanode.interval" in seconds (default value is 30 seconds). NameNode picks a stale datanode as the last target to read from when returning block locations for reads. This feature is by default turned * off *. To turn on the feature, set the HDFS configuration "dfs.namenode.check.stale.datanode" to true.

      Description

      By default, if a box dies, the datanode will be marked as dead by the namenode after 10:30 minutes. In the meantime, this datanode will still be proposed by the nanenode to write blocks or to read replicas. It happens as well if the datanode crashes: there is no shutdown hooks to tell the nanemode we're not there anymore.
      It especially an issue with HBase. HBase regionserver timeout for production is often 30s. So with these configs, when a box dies HBase starts to recover after 30s and, while 10 minutes, the namenode will consider the blocks on the same box as available. Beyond the write errors, this will trigger a lot of missed reads:

      • during the recovery, HBase needs to read the blocks used on the dead box (the ones in the 'HBase Write-Ahead-Log')
      • after the recovery, reading these data blocks (the 'HBase region') will fail 33% of the time with the default number of replica, slowering the data access, especially when the errors are socket timeout (i.e. around 60s most of the time).

      Globally, it would be ideal if HDFS settings could be under HBase settings.
      As a side note, HBase relies on ZooKeeper to detect regionservers issues.

      1. HDFS-3703.patch
        16 kB
        Jing Zhao
      2. HDFS-3703-branch2.patch
        18 kB
        Nicolas Liochon
      3. HDFS-3703-trunk-with-write.patch
        18 kB
        Jing Zhao
      4. HDFS-3703-trunk-read-only.patch
        16 kB
        Jing Zhao
      5. HDFS-3703-trunk-read-only.patch
        16 kB
        Jing Zhao
      6. HDFS-3703-trunk-read-only.patch
        18 kB
        Jing Zhao
      7. HDFS-3703-trunk-read-only.patch
        18 kB
        Jing Zhao
      8. HDFS-3703-trunk-read-only.patch
        18 kB
        Jing Zhao
      9. HDFS-3703-trunk-read-only.patch
        22 kB
        Jing Zhao
      10. 3703-hadoop-1.0.txt
        15 kB
        Ted Yu
      11. HDFS-3703-trunk-read-only.patch
        22 kB
        Jing Zhao
      12. HDFS-3703-branch-1.1-read-only.patch
        13 kB
        Jing Zhao
      13. HDFS-3703-branch-1.1-read-only.patch
        13 kB
        Jing Zhao

        Issue Links

          Activity

          Hide
          stack added a comment -

          ...and writing the deadnodes to a file and kicking namenode to read said file so the deadnode is excluded when allocating nodes for replicas – the decommissioning facility – won't do in this case I'd say.

          Show
          stack added a comment - ...and writing the deadnodes to a file and kicking namenode to read said file so the deadnode is excluded when allocating nodes for replicas – the decommissioning facility – won't do in this case I'd say.
          Hide
          Suresh Srinivas added a comment -

          Globally, it would be ideal if HDFS settings could be under HBase settings.

          Can you describe this better?

          HBase relies on ZooKeeper to detect regionservers issues

          Whether ZooKeeper or Datanode heartbeat to Namenode, at a high level mechanisms are similar. The problem is one of choosing right timeout. Currently this is configurable in HDFS and 10 minutes is chosen as the timeout. I suggest running some experiments with setting this to a more aggressive value. I agree that this is a very conservative time. But false positives here could result in replication storm.

          Show
          Suresh Srinivas added a comment - Globally, it would be ideal if HDFS settings could be under HBase settings. Can you describe this better? HBase relies on ZooKeeper to detect regionservers issues Whether ZooKeeper or Datanode heartbeat to Namenode, at a high level mechanisms are similar. The problem is one of choosing right timeout. Currently this is configurable in HDFS and 10 minutes is chosen as the timeout. I suggest running some experiments with setting this to a more aggressive value. I agree that this is a very conservative time. But false positives here could result in replication storm.
          Hide
          Nicolas Liochon added a comment -

          Can you describe this better?

          If we see this in layers, we've got three layers:
          1) Hardware
          2) HDFS
          3) HBase

          Here, the layer3 knows/guess that the layer1 is dead, while the layer in the middle does not know it. That's not a perfect example of encapsulation . HBase is saying to hdfs 'you know, I want some blocks, but may be this datanode is not good, I'm not sure, but don't use it please'. Kind of strange (but useful short term).

          Today, when there is a global issue, HBase starts its recovery while hdfs is still ignoring the issue. It leads to a nightmare of socket exception all over the place, as HBase is directed to dead nodes again and again. HDFS should know before HBase what's going on. So if HBase it set with a timeout of 30s, HDFS should have 20s or something like this.

          Whether ZooKeeper or Datanode heartbeat to Namenode, at a high level mechanisms are similar.

          Fully agreed. Just that if the issues comes from ZK or ZK links, HBase and HDFS they would have a similar view of the situation (may be a wrong view but the same view). On the other hand, there are possible improvements, not available in ZK, but hopefully available a day, when there will be more code to share (I'm thinking about ZOOKEEPER-702). Also, still long term, ZK creates one tcp connection per process monitored. If multiple hadoop processes share the same tech, it will make sense to have a shared component on each computer to lower the number of connections. I'm not aware on anything on this subject in ZK, so that's science fiction today. I've got other stuff like this in mind, but you got the idea .

          So, I fully agree with your main point, today the real issue is the right timeout.

          The problem is one of choosing right timeout. Currently this is configurable in HDFS and 10 minutes is chosen as the timeout. I suggest runningt some experiments with setting this to a more aggressive value. I agree that this is a very conservative time. But false positives here could result in replication storm.

          Agreed, even we've the current setting, people had issues in the past. 10 minutes seems to be a reasonable-real world-validated timeout for re-replicating. I don't think it's a good idea to make lower. However, I think it would be good to have a middle state between fully available and definitively dead: the non responding nodes could be removed from the target list for new blocks and de-prioritize for reads.

          Show
          Nicolas Liochon added a comment - Can you describe this better? If we see this in layers, we've got three layers: 1) Hardware 2) HDFS 3) HBase Here, the layer3 knows/guess that the layer1 is dead, while the layer in the middle does not know it. That's not a perfect example of encapsulation . HBase is saying to hdfs 'you know, I want some blocks, but may be this datanode is not good, I'm not sure, but don't use it please'. Kind of strange (but useful short term). Today, when there is a global issue, HBase starts its recovery while hdfs is still ignoring the issue. It leads to a nightmare of socket exception all over the place, as HBase is directed to dead nodes again and again. HDFS should know before HBase what's going on. So if HBase it set with a timeout of 30s, HDFS should have 20s or something like this. Whether ZooKeeper or Datanode heartbeat to Namenode, at a high level mechanisms are similar. Fully agreed. Just that if the issues comes from ZK or ZK links, HBase and HDFS they would have a similar view of the situation (may be a wrong view but the same view). On the other hand, there are possible improvements, not available in ZK, but hopefully available a day, when there will be more code to share (I'm thinking about ZOOKEEPER-702 ). Also, still long term, ZK creates one tcp connection per process monitored. If multiple hadoop processes share the same tech, it will make sense to have a shared component on each computer to lower the number of connections. I'm not aware on anything on this subject in ZK, so that's science fiction today. I've got other stuff like this in mind, but you got the idea . So, I fully agree with your main point, today the real issue is the right timeout. The problem is one of choosing right timeout. Currently this is configurable in HDFS and 10 minutes is chosen as the timeout. I suggest runningt some experiments with setting this to a more aggressive value. I agree that this is a very conservative time. But false positives here could result in replication storm. Agreed, even we've the current setting, people had issues in the past. 10 minutes seems to be a reasonable-real world-validated timeout for re-replicating. I don't think it's a good idea to make lower. However, I think it would be good to have a middle state between fully available and definitively dead: the non responding nodes could be removed from the target list for new blocks and de-prioritize for reads.
          Hide
          Suresh Srinivas added a comment -

          Thanks for answering my questions. Here is some thing we could do:

          1. If the last heartbeat time for datanode is more than certain threshold T from the last time a namenode datanode processed the heartbeat, consider it stale. For writes do not use such datanodes (if possible). For reads, use such datanodes as last datanode in the pipeline.
          2. Have a configuration option for T. If it is not configured, the feature is turned off.

          If we do this, do we still need HDFS-3705 and HDFS-3706?

          Show
          Suresh Srinivas added a comment - Thanks for answering my questions. Here is some thing we could do: If the last heartbeat time for datanode is more than certain threshold T from the last time a namenode datanode processed the heartbeat, consider it stale. For writes do not use such datanodes (if possible). For reads, use such datanodes as last datanode in the pipeline. Have a configuration option for T. If it is not configured, the feature is turned off. If we do this, do we still need HDFS-3705 and HDFS-3706 ?
          Hide
          Eli Collins added a comment -

          Btw the formula for determining whether a DN is dead is 2 * dfs.namenode.heartbeat.recheck-interval + 10 * 1000 * dfs.heartbeat.interval, the defaults are 5 * 60 *1000 and 3 respectively.

          Note that it's a large value by default for a reason, eg so a network hiccup doesn't introduce a replication storm that results in cascading failures.

          Would a "decommission DN immediately" NN API suffice for your use case?

          Show
          Eli Collins added a comment - Btw the formula for determining whether a DN is dead is 2 * dfs.namenode.heartbeat.recheck-interval + 10 * 1000 * dfs.heartbeat.interval , the defaults are 5 * 60 *1000 and 3 respectively. Note that it's a large value by default for a reason, eg so a network hiccup doesn't introduce a replication storm that results in cascading failures. Would a "decommission DN immediately" NN API suffice for your use case?
          Hide
          Colin Patrick McCabe added a comment -

          It seems like the assumption behind having a "third state" (between dead and alive) is that re-replication consumes a lot of bandwidth and could lead to cascading failures, whereas inbound client traffic won't cause cascading failures. Do we know for a fact that this is always true? Maybe it's true for certain use cases.

          There's also a control theory aspect to this-- once you get feedback in the system, you can have oscillations-- like two datanodes alternating between being marked as "stale" every other minute. Maybe each time client traffic goes to one, it slows it down enough that the heartbeats get marked as stale. Then the same process happens with the other one. So you might actually end up with less effective bandwidth than you would have had without the stale state.

          I guess whatever solution we come up with will have to be tested with some real world workloads to see what happens.

          Show
          Colin Patrick McCabe added a comment - It seems like the assumption behind having a "third state" (between dead and alive) is that re-replication consumes a lot of bandwidth and could lead to cascading failures, whereas inbound client traffic won't cause cascading failures. Do we know for a fact that this is always true? Maybe it's true for certain use cases. There's also a control theory aspect to this-- once you get feedback in the system, you can have oscillations-- like two datanodes alternating between being marked as "stale" every other minute. Maybe each time client traffic goes to one, it slows it down enough that the heartbeats get marked as stale. Then the same process happens with the other one. So you might actually end up with less effective bandwidth than you would have had without the stale state. I guess whatever solution we come up with will have to be tested with some real world workloads to see what happens.
          Hide
          Suresh Srinivas added a comment -

          whereas inbound client traffic won't cause cascading failures

          Can you describe why this would cause cascading failures?

          So you might actually end up with less effective bandwidth than you would have had without the stale state. I guess whatever solution we come up with will have to be tested with some real world workloads to see what happens.

          Heartbeats are processed fairly efficiently in Namenode. I do not think we need worry a lot about it. I have seen very few such stale heartbeats in really large clusters.

          I guess whatever solution we come up with will have to be tested with some real world workloads to see what happens.

          That is why the flag to turn off the feature, if one is not comfortable.

          Show
          Suresh Srinivas added a comment - whereas inbound client traffic won't cause cascading failures Can you describe why this would cause cascading failures? So you might actually end up with less effective bandwidth than you would have had without the stale state. I guess whatever solution we come up with will have to be tested with some real world workloads to see what happens. Heartbeats are processed fairly efficiently in Namenode. I do not think we need worry a lot about it. I have seen very few such stale heartbeats in really large clusters. I guess whatever solution we come up with will have to be tested with some real world workloads to see what happens. That is why the flag to turn off the feature, if one is not comfortable.
          Hide
          Colin Patrick McCabe added a comment -

          Can you describe why [inbound client traffic] would cause cascading failures?

          If you have 1000 clients reading from a single block that's replicated 3x, marking one of the replicas as stale will have a big effect on the traffic coming into the other 2 replicas. At least theoretically, this could lead to either of them becoming stale due to heartbeat timeouts.

          Anyway, I think that this kind of scenario may be rare in practice. There are some major advantages to what you propose for applications like HBase, so I'm looking forward to seeing what you guys come up with.

          Show
          Colin Patrick McCabe added a comment - Can you describe why [inbound client traffic] would cause cascading failures? If you have 1000 clients reading from a single block that's replicated 3x, marking one of the replicas as stale will have a big effect on the traffic coming into the other 2 replicas. At least theoretically, this could lead to either of them becoming stale due to heartbeat timeouts. Anyway, I think that this kind of scenario may be rare in practice. There are some major advantages to what you propose for applications like HBase, so I'm looking forward to seeing what you guys come up with.
          Hide
          Nicolas Liochon added a comment -

          If the last heartbeat time for datanode is more than certain threshold T from the last time a namenode datanode processed the heartbeat, consider it stale. For writes do not use such datanodes (if possible). For reads, use such datanodes as last datanode in the pipeline.

          That would make it. Used with HBase, we would likely set T as 60s per default (considering it as ok for under configured clusters), and 20s for production performance oriented clusters.

          If we do this, do we still need HDFS-3705 and HDFS-3706?

          Hum. When I wrote the jiras, I thought of HDFS-3705 as the short term solution (branch 1) and HDFS-3703 as the long term (branch 2 or 3). In this case it still makes sense to keep HDFS-3705 as an optimisation. If I have to choose between HDFS-3705 and HDFS-3703 on branch-1 my vote goes for HDFS-3703.

          It seems like the assumption behind having a "third state" (between dead and alive) is that re-replication consumes a lot of bandwidth and could lead to cascading failures, whereas inbound client traffic won't cause cascading failures. Do we know for a fact that this is always true? Maybe it's true for certain use cases.

          Yes, it also a specific situation: for write ahead logs, on failure you don't really want to replicate the blocks, you want to find a valid replica, read it, use it, and forget the file. And the whole thing should last 2 minutes max (ideal target beeing much less: 30s). So triggering replication does not help at all in this case. Moreover, if the datanode does not answer to a ping, it's very likely to be just dead, so what HBase is looking for here is getting better recovery time by going directly to the right datanode. Replicating simultaneously just add workload, while hbase itself adds workload as well (hbase clients retries + reading the wal files + writing new ones + assigning regions to new servers, so cross datanodes reads). Any saved workload here is good

          The only bad situation I can think about is:

          • rack1: namode; datanode1
          • rack2: datanode2; client

          If the network between the two racks becomes saturated, the namenode will de-prioritize datanode2. So the client will be told to go to datanode1 while going to datanode2 would have been much better. But I think it's more the exception than the rule. In most cases, if a datanode cannot answer a ping, I prefer not to send new blocks to it.

          Would a "decommission DN immediately" NN API suffice for your use case?

          I don't really like it because from HBase I'm not sure of the datanode state. I guess it's dead, as the regionserver died, but I'm not sure. So decommissioning it from the client is a little bit aggressive, sometimes it's still there. And also it may comes back. That's why we thought about 'lower priority': it seems safer.

          Show
          Nicolas Liochon added a comment - If the last heartbeat time for datanode is more than certain threshold T from the last time a namenode datanode processed the heartbeat, consider it stale. For writes do not use such datanodes (if possible). For reads, use such datanodes as last datanode in the pipeline. That would make it. Used with HBase, we would likely set T as 60s per default (considering it as ok for under configured clusters), and 20s for production performance oriented clusters. If we do this, do we still need HDFS-3705 and HDFS-3706 ? Hum. When I wrote the jiras, I thought of HDFS-3705 as the short term solution (branch 1) and HDFS-3703 as the long term (branch 2 or 3). In this case it still makes sense to keep HDFS-3705 as an optimisation. If I have to choose between HDFS-3705 and HDFS-3703 on branch-1 my vote goes for HDFS-3703 . It seems like the assumption behind having a "third state" (between dead and alive) is that re-replication consumes a lot of bandwidth and could lead to cascading failures, whereas inbound client traffic won't cause cascading failures. Do we know for a fact that this is always true? Maybe it's true for certain use cases. Yes, it also a specific situation: for write ahead logs, on failure you don't really want to replicate the blocks, you want to find a valid replica, read it, use it, and forget the file. And the whole thing should last 2 minutes max (ideal target beeing much less: 30s). So triggering replication does not help at all in this case. Moreover, if the datanode does not answer to a ping, it's very likely to be just dead, so what HBase is looking for here is getting better recovery time by going directly to the right datanode. Replicating simultaneously just add workload, while hbase itself adds workload as well (hbase clients retries + reading the wal files + writing new ones + assigning regions to new servers, so cross datanodes reads). Any saved workload here is good The only bad situation I can think about is: rack1: namode; datanode1 rack2: datanode2; client If the network between the two racks becomes saturated, the namenode will de-prioritize datanode2. So the client will be told to go to datanode1 while going to datanode2 would have been much better. But I think it's more the exception than the rule. In most cases, if a datanode cannot answer a ping, I prefer not to send new blocks to it. Would a "decommission DN immediately" NN API suffice for your use case? I don't really like it because from HBase I'm not sure of the datanode state. I guess it's dead, as the regionserver died, but I'm not sure. So decommissioning it from the client is a little bit aggressive, sometimes it's still there. And also it may comes back. That's why we thought about 'lower priority': it seems safer.
          Hide
          Kihwal Lee added a comment -

          It seems like the assumption behind having a "third state" (between dead and alive) is that re-replication consumes a lot of bandwidth and could lead to cascading failures, whereas inbound client traffic won't cause cascading failures. Do we know for a fact that this is always true? Maybe it's true for certain use cases.

          When the heartbeat interval is reduced, the effect on the possibility of having false positives in failure detection is global, whereas when clients are creating hotspots, the scope is limited. For the former, depending on where the network hickup occurs a large number of datanodes can be declared dead at once. To make things worse, they can come back and go away few time while the issue is getting fixed. This can easily create a system-wide severe performance degradation. We've also seen a case of a job with a huge number of tasks doing many random accesses to a single file, creating a hotspot and making a subset of datanodes slow. But its effect was far less devastating compared to the decommission storm.

          Show
          Kihwal Lee added a comment - It seems like the assumption behind having a "third state" (between dead and alive) is that re-replication consumes a lot of bandwidth and could lead to cascading failures, whereas inbound client traffic won't cause cascading failures. Do we know for a fact that this is always true? Maybe it's true for certain use cases. When the heartbeat interval is reduced, the effect on the possibility of having false positives in failure detection is global, whereas when clients are creating hotspots, the scope is limited. For the former, depending on where the network hickup occurs a large number of datanodes can be declared dead at once. To make things worse, they can come back and go away few time while the issue is getting fixed. This can easily create a system-wide severe performance degradation. We've also seen a case of a job with a huge number of tasks doing many random accesses to a single file, creating a hotspot and making a subset of datanodes slow. But its effect was far less devastating compared to the decommission storm.
          Hide
          Kihwal Lee added a comment -

          I think there are differences in the definition of failures and the recovery/service semantics between HDFS and HBase. Most traditional HDFS use cases are better served by the best-effort, eventual success service semantics than the early fail-out. It would rather wait for the long in-disk error recovery, for example, than failing the datanode early.

          We all know HBase is different in this regard since it's a serving system. The three-layered failure propagation model nkeywal described above does not work well because the definition of failures is different. Also, the cluster health state is a result of distributed decision, which can take a long time. For this reason, even the most obvious failure modes such as fail-stop won't cause the global state to be updated immediately. If a time-sensitive client solely depends on this state from NN, it won't get satisfactory results.

          In addition to providing more hints from NN as Suresh suggested, I believe the client has to be smarter since there is a limit in the freshness of the health state that NN can guarantee at scale. Reading WAL during recovery can be made more predictable if the client is allowed to be more aggressive and proactive at the price of additional resource consumption and extra load. If its usage is limited and controlled, it should be acceptable to provide such a client implementation. I think HDFS-3705 and HDFS-3706 are along this line. I am very much interested in learning the HBase requirements and its wish list more.

          Show
          Kihwal Lee added a comment - I think there are differences in the definition of failures and the recovery/service semantics between HDFS and HBase. Most traditional HDFS use cases are better served by the best-effort, eventual success service semantics than the early fail-out. It would rather wait for the long in-disk error recovery, for example, than failing the datanode early. We all know HBase is different in this regard since it's a serving system. The three-layered failure propagation model nkeywal described above does not work well because the definition of failures is different. Also, the cluster health state is a result of distributed decision, which can take a long time. For this reason, even the most obvious failure modes such as fail-stop won't cause the global state to be updated immediately. If a time-sensitive client solely depends on this state from NN, it won't get satisfactory results. In addition to providing more hints from NN as Suresh suggested, I believe the client has to be smarter since there is a limit in the freshness of the health state that NN can guarantee at scale. Reading WAL during recovery can be made more predictable if the client is allowed to be more aggressive and proactive at the price of additional resource consumption and extra load. If its usage is limited and controlled, it should be acceptable to provide such a client implementation. I think HDFS-3705 and HDFS-3706 are along this line. I am very much interested in learning the HBase requirements and its wish list more.
          Hide
          stack added a comment -

          I think there are differences in the definition of failures and the recovery/service semantics between HDFS and HBase.

          You mean when MR is the client as opposed to HBase?

          What Suresh suggests does not change failure processing. HBase would still "...rather wait for the long in-disk error recovery..." than fail a DN early.

          I want to push back on the suggestion that failures are different between the two systems. When a DN is dead, we want the replication and all that fancy stuff to kick in. The suggestion is that there not be this ten minute binary chasm between 'good' DN and 'bad' DN. The NN has a bit of info on DN state and it could do some soft pedaling in the case of a DN being on a cigarette break or its dead but hasn't stiffened yet.

          I'd think MR tasks would benefit from such a facility too.

          If a time-sensitive client solely depends on this state from NN, it won't get satisfactory results.

          Thats what we have now? Or are you thinking that ten minutes is so long its beyond the vagaries of GC or block reporting holdups? (If so, to our way of thinking one minute is the equivalent of your ten minutes and if the reaction is laggy sometimes, we can deal).

          I think HDFS-3705 and HDFS-3706 are along this line.

          I think we need these too but I think these will be inferior always given they have only the HBase client perspective on the cluster. We can know of a RegionServer crash but not if the DN is there or not. The NN has an inkling.

          I am very much interested in learning the HBase requirements and its wish list more.

          Regards DN failure detection time or in general? If the former, I think nkeywal has done a good job explaining our story.

          Show
          stack added a comment - I think there are differences in the definition of failures and the recovery/service semantics between HDFS and HBase. You mean when MR is the client as opposed to HBase? What Suresh suggests does not change failure processing. HBase would still "...rather wait for the long in-disk error recovery..." than fail a DN early. I want to push back on the suggestion that failures are different between the two systems. When a DN is dead, we want the replication and all that fancy stuff to kick in. The suggestion is that there not be this ten minute binary chasm between 'good' DN and 'bad' DN. The NN has a bit of info on DN state and it could do some soft pedaling in the case of a DN being on a cigarette break or its dead but hasn't stiffened yet. I'd think MR tasks would benefit from such a facility too. If a time-sensitive client solely depends on this state from NN, it won't get satisfactory results. Thats what we have now? Or are you thinking that ten minutes is so long its beyond the vagaries of GC or block reporting holdups? (If so, to our way of thinking one minute is the equivalent of your ten minutes and if the reaction is laggy sometimes, we can deal). I think HDFS-3705 and HDFS-3706 are along this line. I think we need these too but I think these will be inferior always given they have only the HBase client perspective on the cluster. We can know of a RegionServer crash but not if the DN is there or not. The NN has an inkling. I am very much interested in learning the HBase requirements and its wish list more. Regards DN failure detection time or in general? If the former, I think nkeywal has done a good job explaining our story.
          Hide
          Sanjay Radia added a comment -

          HDFS-3705 and HDFS-3706 are client based. Suresh's suggestion (the T parameter) is server based.
          If the server based approach is not aggressive enough for some applications then we can consider augmenting with a client-based approach that passes in a parameter T for getBlocks and addBlock operations. The advantage of this alternate client-based approach over the original client-based approach proposed in HDFS-3705 and HDFS-3706 is that it uses the NN's view of which DNs are running, stale, or dead rather then the client's view of which nodes are running or dead.

          Show
          Sanjay Radia added a comment - HDFS-3705 and HDFS-3706 are client based. Suresh's suggestion (the T parameter) is server based. If the server based approach is not aggressive enough for some applications then we can consider augmenting with a client-based approach that passes in a parameter T for getBlocks and addBlock operations. The advantage of this alternate client-based approach over the original client-based approach proposed in HDFS-3705 and HDFS-3706 is that it uses the NN's view of which DNs are running, stale, or dead rather then the client's view of which nodes are running or dead.
          Hide
          Nicolas Liochon added a comment -

          The 3 approaches all have advantages. I don't think they are exclusive:

          • pure server allows to take decision globally with a global view, and to be proactive vs. waiting for the client to complain.
          • hinted-server (the T parameter) allows some clients to be more aggressive, that could be interesting if the cluster is shared for example.
          • pure client allows to manage cases where the client knows that a node is down or very likely down (either because it tried to use it already or because, like hbase wal, if the client is reading this file it's likely because the node is dead).
          Show
          Nicolas Liochon added a comment - The 3 approaches all have advantages. I don't think they are exclusive: pure server allows to take decision globally with a global view, and to be proactive vs. waiting for the client to complain. hinted-server (the T parameter) allows some clients to be more aggressive, that could be interesting if the cluster is shared for example. pure client allows to manage cases where the client knows that a node is down or very likely down (either because it tried to use it already or because, like hbase wal, if the client is reading this file it's likely because the node is dead).
          Hide
          Jing Zhao added a comment -

          With respect to Suresh's proposal:
          If the last heartbeat time for datanode is more than certain threshold T from the last time a namenode datanode processed the heartbeat, consider it stale. For writes do not use such stale datanodes (if possible). For reads, put such stale datanodes at the end of the list.

          In this strategy, since a small T for judging stale state may generate new hotspots on cluster, I propose that T can be calculated as:
          T = t_c + (number of nodes already marked as stale) / (total number of nodes) * (T_d - t_c),
          where t_c is a constant value initially set in the configuration, and T_d is the time for marking as dead (i.e., 10.5 min).

          E.g., t_c can be set as 30s, then when there is no or few nodes marked as stale, we can have a small T to satisfy the HBase requirement. In case that there are large number nodes marked as stale, e.g., near the total number of nodes, T will be almost T_d (i.e., ~10min), and the workload can still be distributed to all the nodes alive.

          Show
          Jing Zhao added a comment - With respect to Suresh's proposal: If the last heartbeat time for datanode is more than certain threshold T from the last time a namenode datanode processed the heartbeat, consider it stale. For writes do not use such stale datanodes (if possible). For reads, put such stale datanodes at the end of the list. In this strategy, since a small T for judging stale state may generate new hotspots on cluster, I propose that T can be calculated as: T = t_c + (number of nodes already marked as stale) / (total number of nodes) * (T_d - t_c), where t_c is a constant value initially set in the configuration, and T_d is the time for marking as dead (i.e., 10.5 min). E.g., t_c can be set as 30s, then when there is no or few nodes marked as stale, we can have a small T to satisfy the HBase requirement. In case that there are large number nodes marked as stale, e.g., near the total number of nodes, T will be almost T_d (i.e., ~10min), and the workload can still be distributed to all the nodes alive.
          Hide
          Nicolas Liochon added a comment -

          For HBase, it would be good to have an option to cap this value, as HBase itself relies on ZooKeeper, and ZooKeeper has a fix timeout. Imho, if a server does not respond to a ping for 30s, in 99% of the cases it won't be able to respond to a read or write request either. We could imagine to slow down the client if we detect that 20% of the cluster is missing... But today we're still optimizing for the simple situations such as a few nodes missing...

          Show
          Nicolas Liochon added a comment - For HBase, it would be good to have an option to cap this value, as HBase itself relies on ZooKeeper, and ZooKeeper has a fix timeout. Imho, if a server does not respond to a ping for 30s, in 99% of the cases it won't be able to respond to a read or write request either. We could imagine to slow down the client if we detect that 20% of the cluster is missing... But today we're still optimizing for the simple situations such as a few nodes missing...
          Hide
          Jing Zhao added a comment -

          This patch handles the stale nodes for reading in a straight way. It adds two configuration parameters to indicate whether to detect stale nodes and the time interval for treating nodes as stale nodes. And the DatanodeManager#sortLocatedBlocks method will check if the datanodes are stale and move possible stale nodes to the end of the list.

          Show
          Jing Zhao added a comment - This patch handles the stale nodes for reading in a straight way. It adds two configuration parameters to indicate whether to detect stale nodes and the time interval for treating nodes as stale nodes. And the DatanodeManager#sortLocatedBlocks method will check if the datanodes are stale and move possible stale nodes to the end of the list.
          Hide
          Nicolas Liochon added a comment -

          Hi Jing, Suresh,

          I'm currently testing the patch with HBase trunk. I rebased your patch on hdfs branch-2, as HBase does not yet build with hdfs trunk. I will keep you updated.

          Show
          Nicolas Liochon added a comment - Hi Jing, Suresh, I'm currently testing the patch with HBase trunk. I rebased your patch on hdfs branch-2, as HBase does not yet build with hdfs trunk. I will keep you updated.
          Hide
          Nicolas Liochon added a comment -

          Hi,

          I've done a test on HBase, with the minicluster. Basically it works . I've added a check on the write path as well.
          The test consists of stopping a datanode and a regionserver while keeping the socket opened to get timeouts on a minimal cluster (3 nodes, replication level 2), after having written 10 millions lines. Without the fix, HBase needs 15 minutes to recover: it gets timeouts all other the place because it goes to the dead datanodes. It starts to works only after the node is marked as dead in the NN. With the fix, in 1 minute it's done.

          I had to add the write path as well, because during the recovery HBase writes many small files (hundreds actually: that's splitting the logs). Without the check on the write path we get directed to the dead DN for all the files.

          Attached: the original patch rebased on branch-2 + the write path.

          Show
          Nicolas Liochon added a comment - Hi, I've done a test on HBase, with the minicluster. Basically it works . I've added a check on the write path as well. The test consists of stopping a datanode and a regionserver while keeping the socket opened to get timeouts on a minimal cluster (3 nodes, replication level 2), after having written 10 millions lines. Without the fix, HBase needs 15 minutes to recover: it gets timeouts all other the place because it goes to the dead datanodes. It starts to works only after the node is marked as dead in the NN. With the fix, in 1 minute it's done. I had to add the write path as well, because during the recovery HBase writes many small files (hundreds actually: that's splitting the logs). Without the check on the write path we get directed to the dead DN for all the files. Attached: the original patch rebased on branch-2 + the write path.
          Hide
          Ted Yu added a comment -
          +  public static final String DFS_DATANODE_STALE_STATE_ENABLE_KEY = "dfs.datanode.stale.enable";
          

          The above controls stale state detection. How about naming it "dfs.datanode.stale.detection.enable" ?

          +  public static final String DFS_DATANODE_STALE_STATE_INTERVAL_KEY = "dfs.datanode.stale.interval";
          

          Similar here: dfs.datanode.stale.detection.interval ?

          Show
          Ted Yu added a comment - + public static final String DFS_DATANODE_STALE_STATE_ENABLE_KEY = "dfs.datanode.stale.enable" ; The above controls stale state detection. How about naming it "dfs.datanode.stale.detection.enable" ? + public static final String DFS_DATANODE_STALE_STATE_INTERVAL_KEY = "dfs.datanode.stale.interval" ; Similar here: dfs.datanode.stale.detection.interval ?
          Hide
          Nicolas Liochon added a comment -

          fyi, I've got an issue when trying with HBase on a real cluster, by unplugging one of the machine when a file is opened for writing: the method to get the file length (DFSInputStream#readBlockLength) uses the dead datanode. I'm having a look at this.

          Show
          Nicolas Liochon added a comment - fyi, I've got an issue when trying with HBase on a real cluster, by unplugging one of the machine when a file is opened for writing: the method to get the file length (DFSInputStream#readBlockLength) uses the dead datanode. I'm having a look at this.
          Hide
          Suresh Srinivas added a comment -

          Nicolas, lets open a separate jira for the issue you mentioned related to DFSInputStream#readBlockLength instead of addressing it in this jira.

          As regards to the patch in this jira, here is what my thoughts are:

          1. For read side the patch is straightforward. We have list of datanodes where the block is. We re-order it based on liveness.
          2. However for the write site, not picking the stale node could result in an issue, especially for small clusters. That is the reason why I think we should do the write side changes in a related jira. We should consider making stale timeout adaptive to the number of nodes marked stale in the cluster as discussed in the previous comments. Additionally we should consider having a separate configuration for write skipping the stale nodes.

          Some early comments for the patch:
          Comments:

          1. Typo compartor
          2. Add annotation @InterfaceAudience.Private to DecomStaleComparator class
          3. Default stale period could be a bit longer 30s. Again I know this is arbitrary, but still perfer longer timeout.
          4. Instead of BlockPlacementPolicyDefault#skipStaleNodes, rename it to checkForStaleNodes. Currently the variable name means the opposite of what it is used for.
          5. Can you add description of what stale means in javadoc for DatanodeInfo#isStale(). Add pointer to configuration that decides the stale period.
          6. DFS_DATANODE_STALE_STATE_ENABLE_KEY should be named DFS_NAMENODE_CHECK_STALE_DATANODE_KEY. (DFS_NAMENODE prefix means it is used by the namenode). Change the value to dfs.namenode....
          7. DFS_DATANODE_STALE_STATE_INTERVAL_KEY should be named DFS_NAMENODE_STALE_DATNODE_INTERVAL_KEY. Change the value to dfs.namenode...
          8. "node is staled" to "node is stale". In the same debug, it is a good idea to print the timesince last update. This should help in debugging.
          9. Why reset to default value if the value is smaller? We should just print warning and continue.
          10. Why add public method DatanodeManager#setCheckStaleDatanodes()?
          11. Instead of making setHeartbeatsDisabledForTests for public, you could provide access to that method using DatanodeTestUtils
          12. Please add descritipn for the newly added properties in hdfs-default.xml and how it is used.

          I have not reviewed the tests yet.

          Show
          Suresh Srinivas added a comment - Nicolas, lets open a separate jira for the issue you mentioned related to DFSInputStream#readBlockLength instead of addressing it in this jira. As regards to the patch in this jira, here is what my thoughts are: For read side the patch is straightforward. We have list of datanodes where the block is. We re-order it based on liveness. However for the write site, not picking the stale node could result in an issue, especially for small clusters. That is the reason why I think we should do the write side changes in a related jira. We should consider making stale timeout adaptive to the number of nodes marked stale in the cluster as discussed in the previous comments. Additionally we should consider having a separate configuration for write skipping the stale nodes. Some early comments for the patch: Comments: Typo compartor Add annotation @InterfaceAudience.Private to DecomStaleComparator class Default stale period could be a bit longer 30s. Again I know this is arbitrary, but still perfer longer timeout. Instead of BlockPlacementPolicyDefault#skipStaleNodes, rename it to checkForStaleNodes. Currently the variable name means the opposite of what it is used for. Can you add description of what stale means in javadoc for DatanodeInfo#isStale(). Add pointer to configuration that decides the stale period. DFS_DATANODE_STALE_STATE_ENABLE_KEY should be named DFS_NAMENODE_CHECK_STALE_DATANODE_KEY. (DFS_NAMENODE prefix means it is used by the namenode). Change the value to dfs.namenode.... DFS_DATANODE_STALE_STATE_INTERVAL_KEY should be named DFS_NAMENODE_STALE_DATNODE_INTERVAL_KEY. Change the value to dfs.namenode... "node is staled" to "node is stale". In the same debug, it is a good idea to print the timesince last update. This should help in debugging. Why reset to default value if the value is smaller? We should just print warning and continue. Why add public method DatanodeManager#setCheckStaleDatanodes()? Instead of making setHeartbeatsDisabledForTests for public, you could provide access to that method using DatanodeTestUtils Please add descritipn for the newly added properties in hdfs-default.xml and how it is used. I have not reviewed the tests yet.
          Hide
          Jing Zhao added a comment -

          Nicholas, Ted, Suresh, Thanks a lot for the feedback and comments! So I first separated the writing part from the current patch, and I'm working on a maybe more adaptive writing solution. I've also addressed Suresh and Ted's comments.

          Show
          Jing Zhao added a comment - Nicholas, Ted, Suresh, Thanks a lot for the feedback and comments! So I first separated the writing part from the current patch, and I'm working on a maybe more adaptive writing solution. I've also addressed Suresh and Ted's comments.
          Hide
          Ted Yu added a comment -
          +            + ", which is smaller than the minimal value "
          +            + DFSConfigKeys.DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT
          +            + ". Reset to the default value.");
          

          DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT is no longer the minimum, it is the default.
          We're not resetting to default anymore.

          +        Arrays.sort(b.getLocations(), new DFSUtil.DecomStaleComparator(
          +            staleInterval));
          

          Do we have to instantiate a new comparator every time ?

          Show
          Ted Yu added a comment - + + ", which is smaller than the minimal value " + + DFSConfigKeys.DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT + + ". Reset to the default value." ); DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT is no longer the minimum, it is the default. We're not resetting to default anymore. + Arrays.sort(b.getLocations(), new DFSUtil.DecomStaleComparator( + staleInterval)); Do we have to instantiate a new comparator every time ?
          Hide
          Nicolas Liochon added a comment -

          Hi,

          For the DFSInputStream#readBlockLength, I got it I think: the LocatedBlocks class contains two sub structures:

          • a list of LocatedBlocks
          • a lastLocatedBlocks for the blocks in construction

          Today, the patch does the reorder only for the first structure, it should be done for the second one as well. I tested (by hacking, not directly in HDFS) in HBase, I didn't get any network error in my test.

          To me, it's a part of the read path, so it should be included in this patch.

          I'm ok to create another JIRA for the write path, I will do it.

          Thank you all for your work!

          Nicolas

          Show
          Nicolas Liochon added a comment - Hi, For the DFSInputStream#readBlockLength, I got it I think: the LocatedBlocks class contains two sub structures: a list of LocatedBlocks a lastLocatedBlocks for the blocks in construction Today, the patch does the reorder only for the first structure, it should be done for the second one as well. I tested (by hacking, not directly in HDFS) in HBase, I didn't get any network error in my test. To me, it's a part of the read path, so it should be included in this patch. I'm ok to create another JIRA for the write path, I will do it. Thank you all for your work! Nicolas
          Hide
          Jing Zhao added a comment -

          Thanks for comments Ted. I've changed the debugging info. For the comparator, because we may want to change the stale interval dynamically for the writing-version implementation, so the stale interval is currently passed to the comparator as a construction parameter, and we simply create a new comparator instance everytime we do the comparison.

          Show
          Jing Zhao added a comment - Thanks for comments Ted. I've changed the debugging info. For the comparator, because we may want to change the stale interval dynamically for the writing-version implementation, so the stale interval is currently passed to the comparator as a construction parameter, and we simply create a new comparator instance everytime we do the comparison.
          Hide
          Ted Yu added a comment -

          The retrieval of value for staleInterval is in DatanodeManager ctor.
          I am not familiar with trunk code base. Does this mean that DatanodeManager is able to be configured at runtime ?

          Show
          Ted Yu added a comment - The retrieval of value for staleInterval is in DatanodeManager ctor. I am not familiar with trunk code base. Does this mean that DatanodeManager is able to be configured at runtime ?
          Hide
          Jing Zhao added a comment -

          So after the datanodemanager loads the initial stale interval value from the configuration, we plan to change the value of the stale interval in a dynamic way based on the number of nodes marked stale in the cluster (the details are discussed in the previous comments). That will be during the runtime.

          Show
          Jing Zhao added a comment - So after the datanodemanager loads the initial stale interval value from the configuration, we plan to change the value of the stale interval in a dynamic way based on the number of nodes marked stale in the cluster (the details are discussed in the previous comments). That will be during the runtime.
          Hide
          Ted Yu added a comment -

          based on the number of nodes marked stale in the cluster

          So this is a cluster-wide setting. I think a singleton can be used here: we just need to update the comparator's staleInterval.

          Show
          Ted Yu added a comment - based on the number of nodes marked stale in the cluster So this is a cluster-wide setting. I think a singleton can be used here: we just need to update the comparator's staleInterval.
          Hide
          Suresh Srinivas added a comment -

          So this is a cluster-wide setting. I think a singleton can be used here: we just need to update the comparator's staleInterval.

          Ted, I am not sure this is going to be such a big issue. It is a small object that is created and perhaps garbage collected.

          Show
          Suresh Srinivas added a comment - So this is a cluster-wide setting. I think a singleton can be used here: we just need to update the comparator's staleInterval. Ted, I am not sure this is going to be such a big issue. It is a small object that is created and perhaps garbage collected.
          Hide
          Jing Zhao added a comment -

          Nicolas: I'm also looking at the lastLocatedBlocks part. By the way, could you please share more details about the DFSInputStream#readBlockLength problem (such as the exception information)?

          Show
          Jing Zhao added a comment - Nicolas: I'm also looking at the lastLocatedBlocks part. By the way, could you please share more details about the DFSInputStream#readBlockLength problem (such as the exception information)?
          Hide
          Nicolas Liochon added a comment -

          @Jing

          Sure. The scenario is:

          • open a file for writing, write some data, keep the file open
          • kill one of the datanodes storing one of the replica of the last blocks (ex: unplug the computer from the network).
          • open the file for reading on another machine

          During the "open", he DFSClient will try to get the file length, this requires to connect (via ipc.Client) to one of the datanodes owning the last block. If you're going to the dead datanodes, the ipc call will fail. It will retry ten times (by default), and will log the error (in my case, it was a noRouteToHost, but it could be socketTimeout or alike).

          In HBase, we have this when we try to read the Write-Ahead-Log after a regionserver crash.

          Show
          Nicolas Liochon added a comment - @Jing Sure. The scenario is: open a file for writing, write some data, keep the file open kill one of the datanodes storing one of the replica of the last blocks (ex: unplug the computer from the network). open the file for reading on another machine During the "open", he DFSClient will try to get the file length, this requires to connect (via ipc.Client) to one of the datanodes owning the last block. If you're going to the dead datanodes, the ipc call will fail. It will retry ten times (by default), and will log the error (in my case, it was a noRouteToHost, but it could be socketTimeout or alike). In HBase, we have this when we try to read the Write-Ahead-Log after a regionserver crash.
          Hide
          Suresh Srinivas added a comment -

          Some comments on the tests in the patch - testSortLocatedBlocks() - why do you need MiniDFSCluster for this test? This test can be deleted given it is tested by testReadSelectNonStaleDatanode.

          If you modify testReadSelectNonStaleDatanode where an incomplete file is written (no close called) the test that Nicolas was talking about can be implemented.

          Show
          Suresh Srinivas added a comment - Some comments on the tests in the patch - testSortLocatedBlocks() - why do you need MiniDFSCluster for this test? This test can be deleted given it is tested by testReadSelectNonStaleDatanode. If you modify testReadSelectNonStaleDatanode where an incomplete file is written (no close called) the test that Nicolas was talking about can be implemented.
          Hide
          Jing Zhao added a comment -

          I wrote a test code (not included in the patch) to re-produce the scenario that Nicolas reported. Currently I modified FSNameSysetm#getBlockLocations which now adjusts target datanode sequence for the lastLocatedBlocks based on their stale states. Also added corresponding test code for the modification.

          Show
          Jing Zhao added a comment - I wrote a test code (not included in the patch) to re-produce the scenario that Nicolas reported. Currently I modified FSNameSysetm#getBlockLocations which now adjusts target datanode sequence for the lastLocatedBlocks based on their stale states. Also added corresponding test code for the modification.
          Hide
          Ted Yu added a comment -
                  Arrays.sort(b.getLocations(), new DFSUtil.DecomStaleComparator(
                      staleInterval));
          

          I think we can lift the comparator construction out of the for loop.

          Show
          Ted Yu added a comment - Arrays.sort(b.getLocations(), new DFSUtil.DecomStaleComparator( staleInterval)); I think we can lift the comparator construction out of the for loop.
          Hide
          Jing Zhao added a comment -

          Thanks Ted! I've addressed your comment and moved the comparator construction out of the for loop.

          Show
          Jing Zhao added a comment - Thanks Ted! I've addressed your comment and moved the comparator construction out of the for loop.
          Hide
          Suresh Srinivas added a comment -

          I think we can lift the comparator construction out of the for loop.

          Good catch

          Show
          Suresh Srinivas added a comment - I think we can lift the comparator construction out of the for loop. Good catch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12544677/HDFS-3703-trunk-read-only.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestPread
          org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
          org.apache.hadoop.hdfs.TestDFSPermission
          org.apache.hadoop.hdfs.TestLeaseRecovery
          org.apache.hadoop.hdfs.TestFileCreation
          org.apache.hadoop.hdfs.security.token.block.TestBlockToken
          org.apache.hadoop.fs.TestHDFSFileContextMainOperations
          org.apache.hadoop.hdfs.server.namenode.TestBlockUnderConstruction
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDistributedFileSystem
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA
          org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestHDFSFileSystemContract

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3170//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3170//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12544677/HDFS-3703-trunk-read-only.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestPread org.apache.hadoop.hdfs.TestDFSUpgradeFromImage org.apache.hadoop.hdfs.TestDFSPermission org.apache.hadoop.hdfs.TestLeaseRecovery org.apache.hadoop.hdfs.TestFileCreation org.apache.hadoop.hdfs.security.token.block.TestBlockToken org.apache.hadoop.fs.TestHDFSFileContextMainOperations org.apache.hadoop.hdfs.server.namenode.TestBlockUnderConstruction org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDistributedFileSystem org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestHDFSFileSystemContract +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3170//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3170//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          Some of the failures I checked were related to the test failures.

          Additionally there is a lot of code duplication in tests added. Could you try to reduce the code duplication? Perhaps a method with closeFile flag and add two tests one calling with closeFile set to true and the other with closeFile set to false?

          Show
          Suresh Srinivas added a comment - Some of the failures I checked were related to the test failures. Additionally there is a lot of code duplication in tests added. Could you try to reduce the code duplication? Perhaps a method with closeFile flag and add two tests one calling with closeFile set to true and the other with closeFile set to false?
          Hide
          Jing Zhao added a comment -

          Correct the error that the old patch did not check if the lastupdatedblock is null.

          Show
          Jing Zhao added a comment - Correct the error that the old patch did not check if the lastupdatedblock is null.
          Hide
          Jing Zhao added a comment -

          Also combine the two test cases in TestGetBlocks.

          Show
          Jing Zhao added a comment - Also combine the two test cases in TestGetBlocks.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12544728/HDFS-3703-trunk-read-only.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestDatanodeBlockScanner
          org.apache.hadoop.hdfs.TestPersistBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3174//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3174//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12544728/HDFS-3703-trunk-read-only.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.hdfs.TestPersistBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3174//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3174//console This message is automatically generated.
          Hide
          Jing Zhao added a comment -

          The two test failures are mentioned in HDFS-3811 and HDFS-3902, which seem to be unrelated with the current patch.

          Show
          Jing Zhao added a comment - The two test failures are mentioned in HDFS-3811 and HDFS-3902 , which seem to be unrelated with the current patch.
          Hide
          Ted Yu added a comment -

          @Jing:
          If backport to 1.0 is needed and you're busy, I can give it a try.

          Show
          Ted Yu added a comment - @Jing: If backport to 1.0 is needed and you're busy, I can give it a try.
          Hide
          Nicolas Liochon added a comment -

          Well, I think it would be quite useful on v1 as well, as HDFS-3912: there is nothing to do in HBase to benefit from it, so the more hdfs version it goes in, the better it is .

          Show
          Nicolas Liochon added a comment - Well, I think it would be quite useful on v1 as well, as HDFS-3912 : there is nothing to do in HBase to benefit from it, so the more hdfs version it goes in, the better it is .
          Hide
          Jing Zhao added a comment -

          Ted: Thanks very much for volunteering! And please go ahead.

          Show
          Jing Zhao added a comment - Ted: Thanks very much for volunteering! And please go ahead.
          Hide
          Ted Yu added a comment -

          In hadoop 1.0, I don't see the following code in FSNamesystem.getBlockLocations():

          -      // Move decommissioned datanodes to the bottom
          -      Arrays.sort(b.getLocations(), DFSUtil.DECOM_COMPARATOR);
          

          Should the backport bring in the above as well ?

          Show
          Ted Yu added a comment - In hadoop 1.0, I don't see the following code in FSNamesystem.getBlockLocations(): - // Move decommissioned datanodes to the bottom - Arrays.sort(b.getLocations(), DFSUtil.DECOM_COMPARATOR); Should the backport bring in the above as well ?
          Hide
          Suresh Srinivas added a comment -

          Should the backport bring in the above as well ?

          Decom semantics are different in branch-1. So please skip that part and only include the part from this patch.

          Show
          Suresh Srinivas added a comment - Should the backport bring in the above as well ? Decom semantics are different in branch-1. So please skip that part and only include the part from this patch.
          Hide
          Suresh Srinivas added a comment -

          +1 for the trunk patch.

          Show
          Suresh Srinivas added a comment - +1 for the trunk patch.
          Hide
          Ted Yu added a comment -

          @Suresh:
          If I understand your comment correctly, I should rename DFSUtil.DecomStaleComparator to DFSUtil.StaleComparator and omit decommision status comparison.

          Show
          Ted Yu added a comment - @Suresh: If I understand your comment correctly, I should rename DFSUtil.DecomStaleComparator to DFSUtil.StaleComparator and omit decommision status comparison.
          Hide
          Suresh Srinivas added a comment -

          If I understand your comment correctly, I should rename DFSUtil.DecomStaleComparator to DFSUtil.StaleComparator and omit decommision status comparison.

          Yes.

          Show
          Suresh Srinivas added a comment - If I understand your comment correctly, I should rename DFSUtil.DecomStaleComparator to DFSUtil.StaleComparator and omit decommision status comparison. Yes.
          Hide
          Jing Zhao added a comment -

          Made some minor changes to make the (test) code clean.

          Show
          Jing Zhao added a comment - Made some minor changes to make the (test) code clean.
          Hide
          Ted Yu added a comment -

          Attaching first patch for hadoop 1.0
          I don't see how heart beat can be temporarily disabled for DataNode, so I commented out related calls.
          I currently got:

          Testcase: testReadSelectNonStaleDatanode took 0.879 sec
            Caused an ERROR
          Directory /home/hduser/1-hadoop/build/test/data/dfs/name1 is in an inconsistent state: storage directory does not exist or is not accessible.
          org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/hduser/1-hadoop/build/test/data/dfs/name1 is in an inconsistent state: storage directory does not exist or is not accessible.
            at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:304)
            at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:410)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:378)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:277)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:529)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1403)
            at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:278)
          
          Show
          Ted Yu added a comment - Attaching first patch for hadoop 1.0 I don't see how heart beat can be temporarily disabled for DataNode, so I commented out related calls. I currently got: Testcase: testReadSelectNonStaleDatanode took 0.879 sec Caused an ERROR Directory /home/hduser/1-hadoop/build/test/data/dfs/name1 is in an inconsistent state: storage directory does not exist or is not accessible. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/hduser/1-hadoop/build/test/data/dfs/name1 is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:304) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:410) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:378) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:277) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:529) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1403) at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:278)
          Hide
          Jing Zhao added a comment -

          Re-upload the patch for trunk to enable QA.

          Show
          Jing Zhao added a comment - Re-upload the patch for trunk to enable QA.
          Hide
          Ted Yu added a comment -

          Should 1.0.4 be put back as Fix Version ?

          Show
          Ted Yu added a comment - Should 1.0.4 be put back as Fix Version ?
          Hide
          Ted Yu added a comment -

          I noticed the following in FSNamesystem of hadoop 1.0:

              if (conf.getBoolean("dfs.support.append", false)) {
                LOG.warn("The dfs.support.append option is in your configuration, " +
                         "however append is not supported. This configuration option " +
                         "is no longer required to enable sync.");
              }
          

          Since fileSys.append() is called in Jing's test, I am not sure how the test should be ported for hadoop 1.0

          Advice is welcome.

          Show
          Ted Yu added a comment - I noticed the following in FSNamesystem of hadoop 1.0: if (conf.getBoolean( "dfs.support.append" , false )) { LOG.warn( "The dfs.support.append option is in your configuration, " + "however append is not supported. This configuration option " + "is no longer required to enable sync." ); } Since fileSys.append() is called in Jing's test, I am not sure how the test should be ported for hadoop 1.0 Advice is welcome.
          Hide
          Jing Zhao added a comment -

          Ted: in the current patch for trunk I removed the append part from the test. I guess we only need to do a write-without-close operation.

          Show
          Jing Zhao added a comment - Ted: in the current patch for trunk I removed the append part from the test. I guess we only need to do a write-without-close operation.
          Hide
          Ted Yu added a comment -

          What about disabling DataNode heart beat ?
          I guess that should be pulled into hadoop 1.0

          Show
          Ted Yu added a comment - What about disabling DataNode heart beat ? I guess that should be pulled into hadoop 1.0
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12544886/HDFS-3703-trunk-read-only.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestDatanodeBlockScanner
          org.apache.hadoop.hdfs.TestHDFSFileSystemContract

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3182//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3182//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12544886/HDFS-3703-trunk-read-only.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.hdfs.TestHDFSFileSystemContract +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3182//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3182//console This message is automatically generated.
          Hide
          Jing Zhao added a comment -

          Ted: maybe we can also set the heartbeat interval of datanode to a relatively large value? So after we set one of the datanode as stale (by changing its lastupdate time), the new heartbeat of this node will not come to reset its state.

          Show
          Jing Zhao added a comment - Ted: maybe we can also set the heartbeat interval of datanode to a relatively large value? So after we set one of the datanode as stale (by changing its lastupdate time), the new heartbeat of this node will not come to reset its state.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestFsck

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3183//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3183//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestFsck +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3183//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3183//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          Looking at how DataNode heart beat is disabled in trunk, I think we can modify the following if block in DataNode.offerService():

                  if (startTime - lastHeartbeat > heartBeatInterval) {
          

          so that heart beat can be disabled for test.

          In the future, when more tests involving disabling heart beat are backported, they can utilize this facility.

          Advice is welcome.

          Show
          Ted Yu added a comment - Looking at how DataNode heart beat is disabled in trunk, I think we can modify the following if block in DataNode.offerService(): if (startTime - lastHeartbeat > heartBeatInterval) { so that heart beat can be disabled for test. In the future, when more tests involving disabling heart beat are backported, they can utilize this facility. Advice is welcome.
          Hide
          Suresh Srinivas added a comment -

          +1 for the trunk patch. I committed the trunk patch. Thank you Jing for the code and Nicolas for helping in reporting, solving the issue and testing the solution.

          Show
          Suresh Srinivas added a comment - +1 for the trunk patch. I committed the trunk patch. Thank you Jing for the code and Nicolas for helping in reporting, solving the issue and testing the solution.
          Hide
          Suresh Srinivas added a comment -

          @Ted, alternatively we could just shutdown the datanode instead of disabling heartbeat in 1.x, leaving rest of the test as is.

          Show
          Suresh Srinivas added a comment - @Ted, alternatively we could just shutdown the datanode instead of disabling heartbeat in 1.x, leaving rest of the test as is.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2791 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2791/)
          HDFS-3703. Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209)

          Result = SUCCESS
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2791 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2791/ ) HDFS-3703 . Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209) Result = SUCCESS suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2728 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2728/)
          HDFS-3703. Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209)

          Result = SUCCESS
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2728 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2728/ ) HDFS-3703 . Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209) Result = SUCCESS suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Hide
          Suresh Srinivas added a comment -

          Merged the change to branch-2 as well.

          Show
          Suresh Srinivas added a comment - Merged the change to branch-2 as well.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2752 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2752/)
          HDFS-3703. Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209)

          Result = FAILURE
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2752 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2752/ ) HDFS-3703 . Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209) Result = FAILURE suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Hide
          Ted Yu added a comment -

          I think disabling the data node heart beat simulates GC pause, etc.
          Shutting down data node and restarting would change its internal state.
          Looks like disabling heart beat is useful in test cases.

          What do you think ?

          Show
          Ted Yu added a comment - I think disabling the data node heart beat simulates GC pause, etc. Shutting down data node and restarting would change its internal state. Looks like disabling heart beat is useful in test cases. What do you think ?
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1164 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1164/)
          HDFS-3703. Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209)

          Result = FAILURE
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1164 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1164/ ) HDFS-3703 . Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209) Result = FAILURE suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1195 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1195/)
          HDFS-3703. Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209)

          Result = SUCCESS
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1195 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1195/ ) HDFS-3703 . Datanodes are marked stale if heartbeat is not received in configured timeout and are selected as the last location to read from. Contributed by Jing Zhao. (Revision 1384209) Result = SUCCESS suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1384209 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestGetBlocks.java
          Hide
          Suresh Srinivas added a comment -

          I think disabling the data node heart beat simulates GC pause, etc. Shutting down data node and restarting would change its internal state.

          The changes here are mainly testing the state of the datanode as seen by the namenode. The real state of datanode is immaterial.

          In the trunk version of the patch, Jing chose to disable the heartbeat after marking it stale. This ensures that the datanode marked stale would not send heartbeat and become live again and cause test failure. Shutting down datanode will essentially do the same. A datanode marked stale will not send heartbeat. Also the test finishes with in 10 mins of time by when datanode will be marked dead. This accomplishes what test is doing in trunk with least amount of change.

          That said, if you think it is straight forward to add capability to disable heartbeats at the datanode, do go ahead.

          Show
          Suresh Srinivas added a comment - I think disabling the data node heart beat simulates GC pause, etc. Shutting down data node and restarting would change its internal state. The changes here are mainly testing the state of the datanode as seen by the namenode. The real state of datanode is immaterial. In the trunk version of the patch, Jing chose to disable the heartbeat after marking it stale. This ensures that the datanode marked stale would not send heartbeat and become live again and cause test failure. Shutting down datanode will essentially do the same. A datanode marked stale will not send heartbeat. Also the test finishes with in 10 mins of time by when datanode will be marked dead. This accomplishes what test is doing in trunk with least amount of change. That said, if you think it is straight forward to add capability to disable heartbeats at the datanode, do go ahead.
          Hide
          Nicolas Liochon added a comment -

          Thanks a lot, all. I can't wait to have hardware issues on production to see the difference it makes .
          Ted, I can redo my test on a real cluster with your patch when it's ready if you want.

          Let's not forget our second leg, the write path . I will comment in HDFS-3912.

          Show
          Nicolas Liochon added a comment - Thanks a lot, all. I can't wait to have hardware issues on production to see the difference it makes . Ted, I can redo my test on a real cluster with your patch when it's ready if you want. Let's not forget our second leg, the write path . I will comment in HDFS-3912 .
          Hide
          Ted Yu added a comment -

          @N:
          My patch needs some more work in the unit test.
          I think Jing is taking care of that part.

          Thanks

          Show
          Ted Yu added a comment - @N: My patch needs some more work in the unit test. I think Jing is taking care of that part. Thanks
          Hide
          Suresh Srinivas added a comment -

          We are planning to finish this patch and make it available in release 1.1, due out with RC tomorrow.

          Show
          Suresh Srinivas added a comment - We are planning to finish this patch and make it available in release 1.1, due out with RC tomorrow.
          Hide
          Jing Zhao added a comment -

          Patch for branch 1.1.

          Show
          Jing Zhao added a comment - Patch for branch 1.1.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12545088/HDFS-3703-branch-1.1-read-only.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3188//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12545088/HDFS-3703-branch-1.1-read-only.patch against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3188//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Jing, I tried to commit the 1.1 patch but the new test (testReadSelectNonStaleDatanode) failed. Could you take a look?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Jing, I tried to commit the 1.1 patch but the new test (testReadSelectNonStaleDatanode) failed. Could you take a look?
          Hide
          Jing Zhao added a comment -

          Seems in the new test case, when creating the MiniDFSCluster, I need to set the format parameter to true to avoid the exception that Ted has posted.

          Show
          Jing Zhao added a comment - Seems in the new test case, when creating the MiniDFSCluster, I need to set the format parameter to true to avoid the exception that Ted has posted.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          That's correct. The MiniDFSCluster have to be formatted first.

          +1 the branch-1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - That's correct. The MiniDFSCluster have to be formatted first. +1 the branch-1 patch looks good.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed the branch-1 patch. Thanks, Jing!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed the branch-1 patch. Thanks, Jing!
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.
          Hide
          Varun Sharma added a comment -

          I have a question.

          What happens if a client tries to read a block which is in UNDER_RECOVERY state - do we continue to avoid stale nodes ?

          Thanks
          Varun

          Show
          Varun Sharma added a comment - I have a question. What happens if a client tries to read a block which is in UNDER_RECOVERY state - do we continue to avoid stale nodes ? Thanks Varun
          Hide
          Jing Zhao added a comment -

          For reading NN still returns the stale nodes to client. The only change is to change the sequence of the DN list (move the stale nodes to the end of the DN list).

          Show
          Jing Zhao added a comment - For reading NN still returns the stale nodes to client. The only change is to change the sequence of the DN list (move the stale nodes to the end of the DN list).
          Hide
          Varun Sharma added a comment -

          I actually am seeing an interesting race condition though for under recovery blocks...

          A block is being written to and the DN server is lost (no route to it). Stale timeout = 20 seconds.

          a) The DN holds a lease which expires in 60 seconds
          b) The block is moved to UNDER_RECOVERY state since it was still being written to using the append api
          c) The recovery chooses a primary data node which happens to be the stale datanode and issues a recover block command for this partially written block
          d) This does not succeed, hence other DN(s) are now tried. During recovery the primary DN's job is to reconcile the data nodes on all 3 replicas. Since the stale node does not seem to kick in here, this primary DN tries to also reconcile the lost DN and times out after 15 minutes (20 * 45 retries).

          The same thing goes over and over again and the recovery fails. The block seems to be lost IMHO.

          In the meanwhile a client tries to recover the lease on this file+block, it gets it after the expiration of 60 seconds and it tries to read this block, eventually it is redirected to the bad datanode even though this is > 20 seconds post failure.

          Would it be nice to actually fix this issue and have the primary DN never be the stale DN and also the reconcilation should not involve the bad DN since the recovery never truly happens ?

          THanks
          Varun

          Show
          Varun Sharma added a comment - I actually am seeing an interesting race condition though for under recovery blocks... A block is being written to and the DN server is lost (no route to it). Stale timeout = 20 seconds. a) The DN holds a lease which expires in 60 seconds b) The block is moved to UNDER_RECOVERY state since it was still being written to using the append api c) The recovery chooses a primary data node which happens to be the stale datanode and issues a recover block command for this partially written block d) This does not succeed, hence other DN(s) are now tried. During recovery the primary DN's job is to reconcile the data nodes on all 3 replicas. Since the stale node does not seem to kick in here, this primary DN tries to also reconcile the lost DN and times out after 15 minutes (20 * 45 retries). The same thing goes over and over again and the recovery fails. The block seems to be lost IMHO. In the meanwhile a client tries to recover the lease on this file+block, it gets it after the expiration of 60 seconds and it tries to read this block, eventually it is redirected to the bad datanode even though this is > 20 seconds post failure. Would it be nice to actually fix this issue and have the primary DN never be the stale DN and also the reconcilation should not involve the bad DN since the recovery never truly happens ? THanks Varun
          Hide
          Varun Sharma added a comment -

          Do you know if for a block which is not FINALIZED and is UNDER_CONSTRUCTION state, whether namenode returns all 3 datanodes or it just returns the 1st datanode in the pipeline (the block on datanode side in the RBW, replica being written state) ?

          Show
          Varun Sharma added a comment - Do you know if for a block which is not FINALIZED and is UNDER_CONSTRUCTION state, whether namenode returns all 3 datanodes or it just returns the 1st datanode in the pipeline (the block on datanode side in the RBW, replica being written state) ?
          Hide
          Jing Zhao added a comment -

          If the block is still under construction, the namenode will still return all the expected datanodes to the client. Only if the block has some corrupt replicas, those corrupt DN will be excluded (but all corrupt DN will be returned if all replicas are corrupt).

          Show
          Jing Zhao added a comment - If the block is still under construction, the namenode will still return all the expected datanodes to the client. Only if the block has some corrupt replicas, those corrupt DN will be excluded (but all corrupt DN will be returned if all replicas are corrupt).
          Hide
          Varun Sharma added a comment -

          Thanks, Jing..

          This holds for UNDER_RECOVERY blocks as well ? Also, is we have hdfs 3703 - we would be ordering such that stale node is the last...

          Show
          Varun Sharma added a comment - Thanks, Jing.. This holds for UNDER_RECOVERY blocks as well ? Also, is we have hdfs 3703 - we would be ordering such that stale node is the last...

            People

            • Assignee:
              Jing Zhao
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development