Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.6, 1.7.2, 1.8.0
    • Component/s: tserver
    • Labels:
      None

      Description

      It should be possible to manually roll WALs so that files on decommissioning datanodes are closed and the decommissioning process can complete. At the very least, the logs could be closed after an elapsed period of time, such as an hour.

        Issue Links

          Activity

          Hide
          elserj Josh Elser added a comment -

          1.6.5/1.7.1 triage: good to bump, Eric Newton?

          Show
          elserj Josh Elser added a comment - 1.6.5/1.7.1 triage: good to bump, Eric Newton ?
          Hide
          elserj Josh Elser added a comment -

          Eric Newton I will push this out of 1.7.1 unless I hear otherwise today.

          Show
          elserj Josh Elser added a comment - Eric Newton I will push this out of 1.7.1 unless I hear otherwise today.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user dlmarion opened a pull request:

          https://github.com/apache/accumulo/pull/84

          ACCUMULO-4004: Add new property for WALog max age, close log when age is reached.

          Didn't see any tests, anyone know where they are?

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/apache/accumulo ACCUMULO-4004

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/accumulo/pull/84.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #84


          commit f7d22c395a27ced9fa382b44162371d8b464988d
          Author: Dave Marion <dlmarion@apache.org>
          Date: 2016-03-29T20:02:38Z

          ACCUMULO-4004: Add new property for WALog max age, close log when age is reached.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user dlmarion opened a pull request: https://github.com/apache/accumulo/pull/84 ACCUMULO-4004 : Add new property for WALog max age, close log when age is reached. Didn't see any tests, anyone know where they are? You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/accumulo ACCUMULO-4004 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/84.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #84 commit f7d22c395a27ced9fa382b44162371d8b464988d Author: Dave Marion <dlmarion@apache.org> Date: 2016-03-29T20:02:38Z ACCUMULO-4004 : Add new property for WALog max age, close log when age is reached.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wjsl commented on the pull request:

          https://github.com/apache/accumulo/pull/84#issuecomment-203107908

          I would assume they're under src/test in your repo

          Show
          githubbot ASF GitHub Bot added a comment - Github user wjsl commented on the pull request: https://github.com/apache/accumulo/pull/84#issuecomment-203107908 I would assume they're under src/test in your repo
          Hide
          bills William Slacum added a comment -

          Thanks for the PR, Dave. Some folks I'm working with have been wanting this to make replication more predictive.

          Show
          bills William Slacum added a comment - Thanks for the PR, Dave. Some folks I'm working with have been wanting this to make replication more predictive.
          Hide
          elserj Josh Elser added a comment -

          Some folks I'm working with have been wanting this to make replication more predictive.

          Definitely a nice bonus which gets around a quick of the replication impl.

          It should be possible to manually roll WALs so that files on decommissioning datanodes are closed and the decommissioning process can complete

          Eric Newton, do you have an explanation for why we even need to do this? I'd like to better understand why we need to change our code.

          Show
          elserj Josh Elser added a comment - Some folks I'm working with have been wanting this to make replication more predictive. Definitely a nice bonus which gets around a quick of the replication impl. It should be possible to manually roll WALs so that files on decommissioning datanodes are closed and the decommissioning process can complete Eric Newton , do you have an explanation for why we even need to do this? I'd like to better understand why we need to change our code.
          Hide
          elserj Josh Elser added a comment -

          Or if you know, Dave

          Show
          elserj Josh Elser added a comment - Or if you know, Dave
          Hide
          dlmarion Dave Marion added a comment -

          Basically decommissioning is broken right now in Hadoop 2.

          WALogs stay open until they hit the size threshold, which could be many hours or days in some cases. These open files will prevent a DN from finishing its decommissioning process[1]. If you stop the DN, then the WALog file will not be closed and you could lose data. You have to find the tservers that are writing to the WALog and stop them so that the WALog is closed.

          There is also another nasty bug[2] where the NN gives clients old locations of blocks that have been moved due to decommissioning. As you can imagine this can create all kinds of problems. Then, there is [3] with all of its related issues.

          With this patch, you can set the max age to the amount of time you are willing to wait for a DN to decommission (if you choose to take the risk of hitting [2]).

          [1] https://issues.apache.org/jira/browse/HDFS-3599
          [2] https://issues.apache.org/jira/browse/HDFS-8208
          [3] https://issues.apache.org/jira/browse/HDFS-8406

          Show
          dlmarion Dave Marion added a comment - Basically decommissioning is broken right now in Hadoop 2. WALogs stay open until they hit the size threshold, which could be many hours or days in some cases. These open files will prevent a DN from finishing its decommissioning process [1] . If you stop the DN, then the WALog file will not be closed and you could lose data. You have to find the tservers that are writing to the WALog and stop them so that the WALog is closed. There is also another nasty bug [2] where the NN gives clients old locations of blocks that have been moved due to decommissioning. As you can imagine this can create all kinds of problems. Then, there is [3] with all of its related issues. With this patch, you can set the max age to the amount of time you are willing to wait for a DN to decommission (if you choose to take the risk of hitting [2] ). [1] https://issues.apache.org/jira/browse/HDFS-3599 [2] https://issues.apache.org/jira/browse/HDFS-8208 [3] https://issues.apache.org/jira/browse/HDFS-8406
          Hide
          elserj Josh Elser added a comment -

          Ok, thanks Dave. Good to know that it's specifically issues in HDFS that we're working around.

          Show
          elserj Josh Elser added a comment - Ok, thanks Dave. Good to know that it's specifically issues in HDFS that we're working around.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlmarion commented on the pull request:

          https://github.com/apache/accumulo/pull/84#issuecomment-204115464

          committed locally

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlmarion commented on the pull request: https://github.com/apache/accumulo/pull/84#issuecomment-204115464 committed locally
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlmarion closed the pull request at:

          https://github.com/apache/accumulo/pull/84

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlmarion closed the pull request at: https://github.com/apache/accumulo/pull/84
          Hide
          elserj Josh Elser added a comment -

          Dave Marion, are there any relevant sections in the book that should be updated after this change? Have you given this any thought – it sounds like it would be important/useful for ops people to know about.

          Show
          elserj Josh Elser added a comment - Dave Marion , are there any relevant sections in the book that should be updated after this change? Have you given this any thought – it sounds like it would be important/useful for ops people to know about.
          Hide
          dlmarion Dave Marion added a comment -

          it sounds like it would be important/useful for ops people to know about.

          Agreed. I had not considered updates to the book - I don't have a copy and I don't know if there are planned updates.

          Show
          dlmarion Dave Marion added a comment - it sounds like it would be important/useful for ops people to know about. Agreed. I had not considered updates to the book - I don't have a copy and I don't know if there are planned updates.
          Hide
          elserj Josh Elser added a comment -

          Sorry, my HBase is coming out. I meant the user manual, not the OReilly book.

          Show
          elserj Josh Elser added a comment - Sorry, my HBase is coming out. I meant the user manual, not the OReilly book.
          Hide
          dlmarion Dave Marion added a comment -

          Yes, I will create a ticket to add the documentation for the new property. Good catch.

          Show
          dlmarion Dave Marion added a comment - Yes, I will create a ticket to add the documentation for the new property. Good catch.
          Hide
          elserj Josh Elser added a comment -

          Thank you sir!

          Show
          elserj Josh Elser added a comment - Thank you sir!

            People

            • Assignee:
              dlmarion Dave Marion
              Reporter:
              ecn Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h

                  Development