Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.4, 1.5.0
    • Fix Version/s: 1.4.5, 1.5.1, 1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      specifically make sure DFS clients behavior properly in the presence of HDFS HA failover.

      1. ACCUMULO-1794.1.patch.txt
        10 kB
        Sean Busbey
      2. ACCUMULO-1794.2.patch.txt
        11 kB
        Sean Busbey

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/master from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ]

          ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example

          Show
          ASF subversion and git services added a comment - Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/master from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ] ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example
          Hide
          ASF subversion and git services added a comment -

          Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/1.6.0-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ]

          ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example

          Show
          ASF subversion and git services added a comment - Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/1.6.0-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ] ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example
          Hide
          ASF subversion and git services added a comment -

          Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/1.5.1-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ]

          ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example

          Show
          ASF subversion and git services added a comment - Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/1.5.1-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ] ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example
          Hide
          ASF subversion and git services added a comment -

          Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/1.4.5-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ]

          ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example

          Show
          ASF subversion and git services added a comment - Commit cc4e794b784c3a436c8fe5f81cb2183e09a853e3 in branch refs/heads/1.4.5-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cc4e794 ] ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example
          Hide
          ASF subversion and git services added a comment -

          Commit c2cd0518a5c2489f568814206c0681ef330df7c7 in branch refs/heads/master from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=c2cd051 ]

          ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example

          Show
          ASF subversion and git services added a comment - Commit c2cd0518a5c2489f568814206c0681ef330df7c7 in branch refs/heads/master from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=c2cd051 ] ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example
          Hide
          ASF subversion and git services added a comment -

          Commit c2cd0518a5c2489f568814206c0681ef330df7c7 in branch refs/heads/1.6.0-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=c2cd051 ]

          ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example

          Show
          ASF subversion and git services added a comment - Commit c2cd0518a5c2489f568814206c0681ef330df7c7 in branch refs/heads/1.6.0-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=c2cd051 ] ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example
          Hide
          ASF subversion and git services added a comment -

          Commit c2cd0518a5c2489f568814206c0681ef330df7c7 in branch refs/heads/1.5.1-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=c2cd051 ]

          ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example

          Show
          ASF subversion and git services added a comment - Commit c2cd0518a5c2489f568814206c0681ef330df7c7 in branch refs/heads/1.5.1-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=c2cd051 ] ACCUMULO-1960 ACCUMULO-1794 Add in a default for HADOOP_PREFIX in continuous-env.sh.example
          Hide
          Josh Elser added a comment -

          >> Can we find namenodes for the namespaces configured?
          > This is covered in the current error handling.

          Currently the script only checks the namenodes when it goes to agitate the given namespace. My initial point was that it would be better to validate the namespaces up front once, instead of every time hdfs-agitator does work. Although, I suppose it would be possible that someone changes the configuration out from underneath us (standby NN gets bounced)? Either way, just a passing thought – not a huge issue.

          If Hadoop adds more than 2 namenodes per nameservice in the future

          Haha, I didn't realize that :X

          Show
          Josh Elser added a comment - >> Can we find namenodes for the namespaces configured? > This is covered in the current error handling. Currently the script only checks the namenodes when it goes to agitate the given namespace. My initial point was that it would be better to validate the namespaces up front once, instead of every time hdfs-agitator does work. Although, I suppose it would be possible that someone changes the configuration out from underneath us (standby NN gets bounced)? Either way, just a passing thought – not a huge issue. If Hadoop adds more than 2 namenodes per nameservice in the future Haha, I didn't realize that :X
          Hide
          Sean Busbey added a comment -

          1. Alternate between using haadmin and kill -9'ing the Namenode. We shouldn't see a difference here, but it would be nice to test coordinated failover and automatic failover

          As I mentioned to Keith Turner in the review, doing this will require a heuristic. We can ask the HDSF admin tools for the hostname corresponding to the namenode id, but picking out the namenode process will be version dependent. I think he and I agreed that that sort of thing was better left to something like BigTop, since it attempts to work across projects.

          2. Some more validation before anything else: Can the user sudo to the hdfs admin user as they claim?

          opened as ACCUMULO-1982 about using sudo to users generally.

          Do the executables (hdfs, sudo) exist?

          The existing tests for executability should cover this, no? Or are you looking for more specific error messages?

          Does the namespace provided exist (or can we find any namespaces if we're using all of them)?

          Both of these cases are handled by the current error checking. the error message for the former is confusing (the message complains of a missing configuration value).

          Can we find namenodes for the namespaces configured?

          This is covered in the current error handling.

          The only other thing I'm curious about is when the script tries to choose an random namenode to make active, could we ever get in that block while ZFKC is in the middle of transition? In other words, is it possible to have no active namenodes while automatic failover is happening and we get an error because we try to force the transition?

          Yes, this is certainly possible. As things currently are, we'll simply log a message that this happened and try again the next time around. I couldn't think of anything else worth doing in that case.

          Note that it's also possible for an automatic failover to have changed which namenode is active while we are in the block that says to use the failover command. In that case, if there are only 2 namenodes we'll just do a no-op failover that says everything went fine. If Hadoop adds more than 2 namenodes per nameservice in the future, then I don't know what it will do but I know we'll log it and try again later.

          Show
          Sean Busbey added a comment - 1. Alternate between using haadmin and kill -9'ing the Namenode. We shouldn't see a difference here, but it would be nice to test coordinated failover and automatic failover As I mentioned to Keith Turner in the review, doing this will require a heuristic. We can ask the HDSF admin tools for the hostname corresponding to the namenode id, but picking out the namenode process will be version dependent. I think he and I agreed that that sort of thing was better left to something like BigTop, since it attempts to work across projects. 2. Some more validation before anything else: Can the user sudo to the hdfs admin user as they claim? opened as ACCUMULO-1982 about using sudo to users generally. Do the executables (hdfs, sudo) exist? The existing tests for executability should cover this, no? Or are you looking for more specific error messages? Does the namespace provided exist (or can we find any namespaces if we're using all of them)? Both of these cases are handled by the current error checking. the error message for the former is confusing (the message complains of a missing configuration value). Can we find namenodes for the namespaces configured? This is covered in the current error handling. The only other thing I'm curious about is when the script tries to choose an random namenode to make active, could we ever get in that block while ZFKC is in the middle of transition? In other words, is it possible to have no active namenodes while automatic failover is happening and we get an error because we try to force the transition? Yes, this is certainly possible. As things currently are, we'll simply log a message that this happened and try again the next time around. I couldn't think of anything else worth doing in that case. Note that it's also possible for an automatic failover to have changed which namenode is active while we are in the block that says to use the failover command. In that case, if there are only 2 namenodes we'll just do a no-op failover that says everything went fine. If Hadoop adds more than 2 namenodes per nameservice in the future, then I don't know what it will do but I know we'll log it and try again later.
          Hide
          Josh Elser added a comment -

          Thanks, Sean Busbey!

          Show
          Josh Elser added a comment - Thanks, Sean Busbey !
          Hide
          ASF subversion and git services added a comment -

          Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/master from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ]

          ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.

          Show
          ASF subversion and git services added a comment - Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/master from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ] ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.
          Hide
          ASF subversion and git services added a comment -

          Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/1.6.0-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ]

          ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.

          Show
          ASF subversion and git services added a comment - Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/1.6.0-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ] ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.
          Hide
          ASF subversion and git services added a comment -

          Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/1.5.1-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ]

          ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.

          Show
          ASF subversion and git services added a comment - Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/1.5.1-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ] ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.
          Hide
          ASF subversion and git services added a comment -

          Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/1.4.5-SNAPSHOT from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ]

          ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.

          Show
          ASF subversion and git services added a comment - Commit 19a48da092c6412be93d2c0cae1006cb896303db in branch refs/heads/1.4.5-SNAPSHOT from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=19a48da ] ACCUMULO-1794 Unnecessary pkill invocation. Existing call will catch all necessary.
          Hide
          ASF subversion and git services added a comment -

          Commit 872fd1dfb252e45560b5547aad43399fe433f1a1 in branch refs/heads/1.5.1-SNAPSHOT from Sean Busbey
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=872fd1d ]

          ACCUMULO-1794 adds hdfs failover to continuous integration test.

          Signed-off-by: Josh Elser <elserj@apache.org>

          Show
          ASF subversion and git services added a comment - Commit 872fd1dfb252e45560b5547aad43399fe433f1a1 in branch refs/heads/1.5.1-SNAPSHOT from Sean Busbey [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=872fd1d ] ACCUMULO-1794 adds hdfs failover to continuous integration test. Signed-off-by: Josh Elser <elserj@apache.org>
          Hide
          ASF subversion and git services added a comment -

          Commit 872fd1dfb252e45560b5547aad43399fe433f1a1 in branch refs/heads/1.4.5-SNAPSHOT from Sean Busbey
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=872fd1d ]

          ACCUMULO-1794 adds hdfs failover to continuous integration test.

          Signed-off-by: Josh Elser <elserj@apache.org>

          Show
          ASF subversion and git services added a comment - Commit 872fd1dfb252e45560b5547aad43399fe433f1a1 in branch refs/heads/1.4.5-SNAPSHOT from Sean Busbey [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=872fd1d ] ACCUMULO-1794 adds hdfs failover to continuous integration test. Signed-off-by: Josh Elser <elserj@apache.org>
          Hide
          Josh Elser added a comment -

          Just applied locally and did a little bit of testing on 1.5.1 – Overall looks good to me, but I did think of a couple of things that I'd like to see eventually (or at least get some feedback on the ideas).

          1. Alternate between using haadmin and kill -9'ing the Namenode. We shouldn't see a difference here, but it would be nice to test coordinated failover and automatic failover
          2. Some more validation before anything else: Can the user sudo to the hdfs admin user as they claim? Do the executables (hdfs, sudo) exist? Does the namespace provided exist (or can we find any namespaces if we're using all of them)? Can we find namenodes for the namespaces configured?

          The only other thing I'm curious about is when the script tries to choose an random namenode to make active, could we ever get in that block while ZFKC is in the middle of transition? In other words, is it possible to have no active namenodes while automatic failover is happening and we get an error because we try to force the transition?

          Show
          Josh Elser added a comment - Just applied locally and did a little bit of testing on 1.5.1 – Overall looks good to me, but I did think of a couple of things that I'd like to see eventually (or at least get some feedback on the ideas). Alternate between using haadmin and kill -9 'ing the Namenode. We shouldn't see a difference here, but it would be nice to test coordinated failover and automatic failover Some more validation before anything else: Can the user sudo to the hdfs admin user as they claim? Do the executables (hdfs, sudo) exist? Does the namespace provided exist (or can we find any namespaces if we're using all of them)? Can we find namenodes for the namespaces configured? The only other thing I'm curious about is when the script tries to choose an random namenode to make active, could we ever get in that block while ZFKC is in the middle of transition? In other words, is it possible to have no active namenodes while automatic failover is happening and we get an error because we try to force the transition?
          Hide
          Josh Elser added a comment -

          Just because it was fun – I just saw a case where the master was killed, then a Namenode failover happened, followed by a datanode being killed. After, I saw some table problems on a bad datanode in the DFS pipeline, which then proceeded to clear itself.

          In other words, looking good so far

          Show
          Josh Elser added a comment - Just because it was fun – I just saw a case where the master was killed, then a Namenode failover happened, followed by a datanode being killed. After, I saw some table problems on a bad datanode in the DFS pipeline, which then proceeded to clear itself. In other words, looking good so far
          Hide
          Josh Elser added a comment -

          I didn't get to test this today – I'll put it on my list for tomorrow and try to get this merged in to add to the 1.6.0 testing "suite"

          Show
          Josh Elser added a comment - I didn't get to test this today – I'll put it on my list for tomorrow and try to get this merged in to add to the 1.6.0 testing "suite"
          Hide
          Sean Busbey added a comment -

          Updated patch post-review.

          Show
          Sean Busbey added a comment - Updated patch post-review.
          Hide
          Sean Busbey added a comment -

          Yeah, I've got a version that handles the all-as-single user case. I'll update today.

          A new ticket to move the DataNode process agitation into the hdfs agitator would be good. I realized that the 1.4.x version of agitator.pl doesn't actually do any messing with DataNodes, so doing it in this ticket would essentially require a 1.4 and a 1.5+ version of patches.

          Show
          Sean Busbey added a comment - Yeah, I've got a version that handles the all-as-single user case. I'll update today. A new ticket to move the DataNode process agitation into the hdfs agitator would be good. I realized that the 1.4.x version of agitator.pl doesn't actually do any messing with DataNodes, so doing it in this ticket would essentially require a 1.4 and a 1.5+ version of patches.
          Hide
          Josh Elser added a comment -

          Sean Busbey looking at the RB for this, the only thing still open is our discussion about reworking the agitator.pl to sync up with your changes here. Are you happy with the state up on RB, and then we can open a new issue to track reworking the (current) agitator?

          One thing I will note after ACCUMULO-1960, it is likely going to be desired that the script be smart enough to handle the case where a single user is running all components in the stack and not force them to start the agitator as root.

          Show
          Josh Elser added a comment - Sean Busbey looking at the RB for this, the only thing still open is our discussion about reworking the agitator.pl to sync up with your changes here. Are you happy with the state up on RB, and then we can open a new issue to track reworking the (current) agitator? One thing I will note after ACCUMULO-1960 , it is likely going to be desired that the script be smart enough to handle the case where a single user is running all components in the stack and not force them to start the agitator as root.
          Hide
          Sean Busbey added a comment -

          Attaching initial patch, in case review board is having issues.

          Show
          Sean Busbey added a comment - Attaching initial patch, in case review board is having issues.

            People

            • Assignee:
              Sean Busbey
              Reporter:
              Sean Busbey
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development