Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2936

File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      If an admin wishes to enforce replication today for all the users of their cluster, he may set dfs.namenode.replication.min. This property prevents users from creating files with < expected replication factor.

      However, the value of minimum replication set by the above value is also checked at several other points, especially during completeFile (close) operations. If a condition arises wherein a write's pipeline may have gotten only < minimum nodes in it, the completeFile operation does not successfully close the file and the client begins to hang waiting for NN to replicate the last bad block in the background. This form of hard-guarantee can, for example, bring down clusters of HBase during high xceiver load on DN, or disk fill-ups on many of them, etc..

      I propose we should split the property in two parts:

      • dfs.namenode.replication.min
        • Stays the same name, but only checks file creation time replication factor value and during adjustments made via setrep/etc.
      • dfs.namenode.replication.min.for.write
        • New property that disconnects the rest of the checks from the above property, such as the checks done during block commit, file complete/close, safemode checks for block availability, etc..

      Alternatively, we may also choose to remove the client-side hang of completeFile/close calls with a set number of retries. This would further require discussion about how a file-closure handle ought to be handled.

        Activity

        Hide
        Harsh J added a comment -

        I had this done around submission time but lost my changes as a result of my Mac's disk crashing. Re-did the separation logic over this weekend.

        The patch makes these changes briefly, for those who're interested in reviewing it:

        • The role of the current dfs.namenode.replication.min has been changed to only apply restrictions at the user levels: file creation replication factors and file replication factor adjustments.
        • A new property, dfs.namenode.replication.min.for.write applies for all write conditions, such as adding a block, closing a block, etc.. The former property used to control these layers, which I've now merely split to be another property in case such hard-guarantees aren't required for anything beyond user-layer restrictions.
        • There were no tests for min-replication, so I added those to TestFileCreation.
        • I added a few regression tests for this prop-split to TestReplication.

        This patch (is WIP), while it tests the good conditions, still needs a test for the bad/violation conditions. The last test in TestReplication needs more work before it can work reliably (and not hang at shutdown). The issue it faces is cause of the fact that the DFSClient.close() call NEVER exits if it can't close the file for min-replication reasons. It also happily eats InterruptedException, making it difficult for me to write a test with waitForXSeconds-then-interrupt conditions. However, I'll find another way and fix that shortly (unless we discuss about limiting the completeFile retries, which is infinite at the moment and retried every 0.4 seconds).

        Show
        Harsh J added a comment - I had this done around submission time but lost my changes as a result of my Mac's disk crashing. Re-did the separation logic over this weekend. The patch makes these changes briefly, for those who're interested in reviewing it: The role of the current dfs.namenode.replication.min has been changed to only apply restrictions at the user levels: file creation replication factors and file replication factor adjustments. A new property, dfs.namenode.replication.min.for.write applies for all write conditions, such as adding a block, closing a block, etc.. The former property used to control these layers, which I've now merely split to be another property in case such hard-guarantees aren't required for anything beyond user-layer restrictions. There were no tests for min-replication, so I added those to TestFileCreation. I added a few regression tests for this prop-split to TestReplication. This patch (is WIP), while it tests the good conditions, still needs a test for the bad/violation conditions. The last test in TestReplication needs more work before it can work reliably (and not hang at shutdown). The issue it faces is cause of the fact that the DFSClient.close() call NEVER exits if it can't close the file for min-replication reasons. It also happily eats InterruptedException, making it difficult for me to write a test with waitForXSeconds-then-interrupt conditions. However, I'll find another way and fix that shortly (unless we discuss about limiting the completeFile retries, which is infinite at the moment and retried every 0.4 seconds).
        Hide
        Colin Patrick McCabe added a comment -

        Hi Harsh,

        This seems like a good idea to me.

        Do you think that dfs.namenode.replication.soft.min would be a better term for it? Both of the minimum values are "for write," after all.

        Show
        Colin Patrick McCabe added a comment - Hi Harsh, This seems like a good idea to me. Do you think that dfs.namenode.replication.soft.min would be a better term for it? Both of the minimum values are "for write," after all.
        Hide
        Eli Collins added a comment -

        Why not just check the replication level at file creation/setrep and fail if it's lower than the new configured value? No BlockManager changes etc. This would require the user setrep all existing files to the new minimum replication level when they set it, but that seems acceptable.

        Show
        Eli Collins added a comment - Why not just check the replication level at file creation/setrep and fail if it's lower than the new configured value? No BlockManager changes etc. This would require the user setrep all existing files to the new minimum replication level when they set it, but that seems acceptable.
        Hide
        Harsh J added a comment -

        Eli,

        Thats already available via the current dfs.namenode.replication.min. The problem am trying to address is that close() will hang if the above prop's set copies aren't live.

        For instance, I create file with 3-rep, but get only 2 DNs in pipeline due to load or failure for my block writes, then when I call close(), it will hang infinitely cause there's only two replicas. Hence here I have broken the property into two bits. One that controls creation-only, as you describe. The other controls actual writes, so those may be relaxed down to 1, to allow files to close with 1 or 2-replicas if the situation demands so (NN anyway takes care of under replication later).

        Let me know if this clears any confusion. Also check DFSClient.close() to see the hang loop am talking about, if NN#completeFile returns false.

        Show
        Harsh J added a comment - Eli, Thats already available via the current dfs.namenode.replication.min. The problem am trying to address is that close() will hang if the above prop's set copies aren't live. For instance, I create file with 3-rep, but get only 2 DNs in pipeline due to load or failure for my block writes, then when I call close(), it will hang infinitely cause there's only two replicas. Hence here I have broken the property into two bits. One that controls creation-only, as you describe. The other controls actual writes, so those may be relaxed down to 1, to allow files to close with 1 or 2-replicas if the situation demands so (NN anyway takes care of under replication later). Let me know if this clears any confusion. Also check DFSClient.close() to see the hang loop am talking about, if NN#completeFile returns false.
        Hide
        Colin Patrick McCabe added a comment -

        I'm curious how common it is to only get 2 DNs in the pipeline when you request 3. What kind of conditions lead to this problem and could we address those?

        Show
        Colin Patrick McCabe added a comment - I'm curious how common it is to only get 2 DNs in the pipeline when you request 3. What kind of conditions lead to this problem and could we address those?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Could you define dfs.namenode.replication.min and dfs.namenode.replication.min.for.write for the proposed change? You may have described them in your previous comments but it is not clear. If would be great if you can list out which one is used in each operation.

        Show
        Tsz Wo Nicholas Sze added a comment - Could you define dfs.namenode.replication.min and dfs.namenode.replication.min.for.write for the proposed change? You may have described them in your previous comments but it is not clear. If would be great if you can list out which one is used in each operation.
        Hide
        Eli Collins added a comment -

        For instance, I create file with 3-rep, but get only 2 DNs in pipeline due to load or failure for my block writes, then when I call close(), it will hang infinitely cause there's only two replicas.

        So the bug here isn't that we need a better way to specify an hdfs-wide min replication - we already have dfs.namenode.replication.min. Seems like this is a bug, close shouldn't hang if we have fewer than dfs.namenode.replication.min replicas.

        Show
        Eli Collins added a comment - For instance, I create file with 3-rep, but get only 2 DNs in pipeline due to load or failure for my block writes, then when I call close(), it will hang infinitely cause there's only two replicas. So the bug here isn't that we need a better way to specify an hdfs-wide min replication - we already have dfs.namenode.replication.min. Seems like this is a bug, close shouldn't hang if we have fewer than dfs.namenode.replication.min replicas.
        Hide
        Harsh J added a comment -

        Colin:

        The reasons for DN getting excluded from a client are DN-side errors (network goes down, DN goes down), disk fill-up or xciever load fill-up causing a DN to remain unchosen and thereby lowering the total of choose able DNs in the cluster, etc.. The simplest condition to think of is: All DNs are very busy except two, when my minimum replication requirement for writes is 3. I can technically be allowed to write two replicas, and leave the rest to under replication handler for laters, but there's no way that allows me this today.

        Eli:

        I honestly think min-replication is too hard on users. People anyway write 3-replica files with 1-min-replica today (i.e. write/close passes if only one replica got successfully written) and an admin should have a way to simply, without side-effects, enforce a minimum replication factor that just works.

        But yes, the problem I've observed so far were all with FSDataOutputStream#close() (Sorry, not DFSClient.close(), was a quick ref.)

        Nicholas,

        Done. Please let me know if the current title and description is satisfactory.

        Show
        Harsh J added a comment - Colin: The reasons for DN getting excluded from a client are DN-side errors (network goes down, DN goes down), disk fill-up or xciever load fill-up causing a DN to remain unchosen and thereby lowering the total of choose able DNs in the cluster, etc.. The simplest condition to think of is: All DNs are very busy except two, when my minimum replication requirement for writes is 3. I can technically be allowed to write two replicas, and leave the rest to under replication handler for laters, but there's no way that allows me this today. Eli: I honestly think min-replication is too hard on users. People anyway write 3-replica files with 1-min-replica today (i.e. write/close passes if only one replica got successfully written) and an admin should have a way to simply, without side-effects, enforce a minimum replication factor that just works. But yes, the problem I've observed so far were all with FSDataOutputStream#close() (Sorry, not DFSClient.close(), was a quick ref.) Nicholas, Done. Please let me know if the current title and description is satisfactory.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Hi Harsh,

        Thanks for updating the description. Are you suggesting to use dfs.namenode.replication.min for client-side check and use dfs.namenode.replication.min.for.write for server-side check?

        BTW, "File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication" is the original design of dfs.namenode.replication.min. I think we should not change it.

        Show
        Tsz Wo Nicholas Sze added a comment - Hi Harsh, Thanks for updating the description. Are you suggesting to use dfs.namenode.replication.min for client-side check and use dfs.namenode.replication.min.for.write for server-side check? BTW, "File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication" is the original design of dfs.namenode.replication.min. I think we should not change it.
        Hide
        Harsh J added a comment -

        Nicholas,

        Thanks for updating the description. Are you suggesting to use dfs.namenode.replication.min for client-side check and use dfs.namenode.replication.min.for.write for server-side check?

        Sort of. The former (or whatever replaces the former) should only check file replication factor short-values, which is applied/changed during create/setReplicationFactor alone. Not live block count. This is still a server-side-check. Client side checks would be of no good to an admin.

        BTW, "File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication" is the original design of dfs.namenode.replication.min. I think we should not change it.

        True that that was the intention. A non-behavior changing patch can also be made (wherein default of the for.write property will be what the original min property is). But lets at least provide a way for admins to enforce minimum replication factors on files, without having to worry about pipelines and what not - if an admin so wishes to.

        Setting dfs.replication to final does not work, cause there are create() API calls and setrep() calls that bypass/disregard that config. Essentially thats what lead us down this path - to use minimum, but just at meta-level, not live-block level (as it is today).

        Show
        Harsh J added a comment - Nicholas, Thanks for updating the description. Are you suggesting to use dfs.namenode.replication.min for client-side check and use dfs.namenode.replication.min.for.write for server-side check? Sort of. The former (or whatever replaces the former) should only check file replication factor short-values, which is applied/changed during create/setReplicationFactor alone. Not live block count. This is still a server-side-check. Client side checks would be of no good to an admin. BTW, "File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication" is the original design of dfs.namenode.replication.min. I think we should not change it. True that that was the intention. A non-behavior changing patch can also be made (wherein default of the for.write property will be what the original min property is). But lets at least provide a way for admins to enforce minimum replication factors on files, without having to worry about pipelines and what not - if an admin so wishes to. Setting dfs.replication to final does not work, cause there are create() API calls and setrep() calls that bypass/disregard that config. Essentially thats what lead us down this path - to use minimum, but just at meta-level, not live-block level (as it is today).
        Hide
        Harsh J added a comment -

        Nicholas,

        Would it be acceptable to you if I instead let the dfs.replication.min behave in the same way it does today (by its description, preventing closure of files until replication has been achieved) and instead introduce the new property as a different name, which applies to just set file creation/setrep limits, thereby separating the two?

        Show
        Harsh J added a comment - Nicholas, Would it be acceptable to you if I instead let the dfs.replication.min behave in the same way it does today (by its description, preventing closure of files until replication has been achieved) and instead introduce the new property as a different name, which applies to just set file creation/setrep limits, thereby separating the two?
        Hide
        Harsh J added a comment -

        (i.e. essentially reversing what is in the patch, config-wise, to address your comments)

        Show
        Harsh J added a comment - (i.e. essentially reversing what is in the patch, config-wise, to address your comments)
        Hide
        Harsh J added a comment -

        I've locally finished making the switch, but still need to figure out how to write the file-hanger test. Once it hangs, my test does not currently recover back. As soon as I have this figured out, I'll post another version up for review.

        Show
        Harsh J added a comment - I've locally finished making the switch, but still need to figure out how to write the file-hanger test. Once it hangs, my test does not currently recover back. As soon as I have this figured out, I'll post another version up for review.
        Hide
        Zesheng Wu added a comment -

        Hi Harsh J, any conclusions about this issue? I encountered the same situation these days.

        Show
        Zesheng Wu added a comment - Hi Harsh J , any conclusions about this issue? I encountered the same situation these days.
        Hide
        Ravi Prakash added a comment -

        Thanks Harsh for this JIRA! I would go a different route on this. The min-replication count to me as a user means "It will take that many failures to lose data" . That is a simple concept to reason about. If we create a separate config that applies only for the write pipelines, 1. there is a window of opportunity during which my assumption is not valid (the time it takes for the NN to order that replication), and it makes understanding the concept slightly more complex.

        I would suggest that we should fix the write pipeline to contain the minimum replication count and that the client should wait until that happens. I realize that might be a much bigger change.

        Show
        Ravi Prakash added a comment - Thanks Harsh for this JIRA! I would go a different route on this. The min-replication count to me as a user means "It will take that many failures to lose data" . That is a simple concept to reason about. If we create a separate config that applies only for the write pipelines, 1. there is a window of opportunity during which my assumption is not valid (the time it takes for the NN to order that replication), and it makes understanding the concept slightly more complex. I would suggest that we should fix the write pipeline to contain the minimum replication count and that the client should wait until that happens. I realize that might be a much bigger change.

          People

          • Assignee:
            Harsh J
            Reporter:
            Harsh J
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development