HBase
  1. HBase
  2. HBASE-5222

Stopping replication via the "stop_replication" command in hbase shell on a slave cluster isn't acknowledged in the replication sink

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 0.90.4
    • Fix Version/s: None
    • Component/s: Replication, shell
    • Labels:
      None

      Description

      After running "stop_replication" in the hbase shell on our slave cluster we saw replication continue for weeks. Turns out that the replication sink is missing a check to get the replication state and therefore continued to write.

        Activity

        Hide
        Himanshu Vashishtha added a comment -

        Shouldn't this command be run on the master cluster instead?

        Show
        Himanshu Vashishtha added a comment - Shouldn't this command be run on the master cluster instead?
        Hide
        Jean-Daniel Cryans added a comment -

        stop_replication is a kill switch that should normally kill everything that's related to replication. In this case, it's not stopping the region servers from accepting incoming replication traffic.

        Show
        Jean-Daniel Cryans added a comment - stop_replication is a kill switch that should normally kill everything that's related to replication. In this case, it's not stopping the region servers from accepting incoming replication traffic.
        Hide
        Himanshu Vashishtha added a comment -

        @JD: When you want to use replication, you ought to run these commands (plus setting replication hbase.replication to true in the hase-config.xml) on the master cluster. The slave cluster configs is not changed (in case of simple Master-slave replication).
        So, in case when hbase.replication is false (or default) on the slave, its replication specific code will be null, and make these commands non-effective; no?

        Show
        Himanshu Vashishtha added a comment - @JD: When you want to use replication, you ought to run these commands (plus setting replication hbase.replication to true in the hase-config.xml) on the master cluster. The slave cluster configs is not changed (in case of simple Master-slave replication). So, in case when hbase.replication is false (or default) on the slave, its replication specific code will be null, and make these commands non-effective; no?
        Hide
        Josh Wymer added a comment -

        @HV, @JD: Please correct me if I'm wrong here. If you stop replication on the master, the logs are no longer stored to be pushed down stream like they would with replication enabled. Instead they would be cleaned up based on the default timeout. If we need to stop replicating to a slave cluster for maintenance, etc we don't want the master throwing away non-replicated logs (thinking it has no need to keep them). The bug, however, causes the slave to keep accepting logs even while disabled although the other processes on slave cluster respect the disabled flag.

        Show
        Josh Wymer added a comment - @HV, @JD: Please correct me if I'm wrong here. If you stop replication on the master, the logs are no longer stored to be pushed down stream like they would with replication enabled. Instead they would be cleaned up based on the default timeout. If we need to stop replicating to a slave cluster for maintenance, etc we don't want the master throwing away non-replicated logs (thinking it has no need to keep them). The bug, however, causes the slave to keep accepting logs even while disabled although the other processes on slave cluster respect the disabled flag.
        Hide
        Himanshu Vashishtha added a comment -

        @Josh: In case you want to do some maintenance on the slave cluster, while you do want to resume the replication once it is restored, you don't need to pull the stop trigger. The master cluster RS see that they can't connect to the slave cluster's RS anymore and will keep on waiting till they are up (sleeping/awaking loop).
        But in case you are also stopping the slave cluster's ZK, then you might have to remove/add it again. As the Master cluster just stops caring about it then.

        There is also a jira in the upstream in which you can enable/disable a particular peer (HBase-3143).
        So, afaik, running commands on the slave cluster are futile as its the master cluster which does all the work.

        PS: This is based on "few days using plus 1 day code digging (yesterday" experience. So, let's see what JD says.

        Show
        Himanshu Vashishtha added a comment - @Josh: In case you want to do some maintenance on the slave cluster, while you do want to resume the replication once it is restored, you don't need to pull the stop trigger. The master cluster RS see that they can't connect to the slave cluster's RS anymore and will keep on waiting till they are up (sleeping/awaking loop). But in case you are also stopping the slave cluster's ZK, then you might have to remove/add it again. As the Master cluster just stops caring about it then. There is also a jira in the upstream in which you can enable/disable a particular peer (HBase-3143). So, afaik, running commands on the slave cluster are futile as its the master cluster which does all the work. PS: This is based on "few days using plus 1 day code digging (yesterday " experience. So, let's see what JD says.
        Hide
        Jean-Daniel Cryans added a comment -

        So, let's see what JD says.

        Here he goes:

        When you want to use replication, you ought to run these commands

        Not sure which commands you're talking about. In the specific case of stop_replication, it's a kill switch in the proper sense (quote from wikipedia):

        a kill switch is designed and configured to a) completely abort the operation at all costs and b) be operable in a manner that is quick, simple (so that even a panicking user with impaired executive function can operate it), and, usually, c) be obvious even to an untrained operator or a bystander

        We hit on a) and b), the c) part might not be there yet. The issue here is that the command is respected on the master cluster (when ran there) but not on the slave cluster (when ran there).

        If you stop replication on the master, the logs are no longer stored to be pushed down stream like they would with replication enabled.

        Yep.

        The bug, however, causes the slave to keep accepting logs even while disabled although the other processes on slave cluster respect the disabled flag

        Since it's a kill switch, what's going to happen is the slave cluster is going to drop the log edits. This is not what you want, you want is HBASE-3134.

        So, afaik, running commands on the slave cluster are futile as its the master cluster which does all the work.

        I think you understand the issue here reasonably well, and indeed most of the commands won't do anything on the slave cluster, except here the kill switch should stop all replication-related activity including applying incoming logs.

        Show
        Jean-Daniel Cryans added a comment - So, let's see what JD says. Here he goes: When you want to use replication, you ought to run these commands Not sure which commands you're talking about. In the specific case of stop_replication , it's a kill switch in the proper sense (quote from wikipedia): a kill switch is designed and configured to a) completely abort the operation at all costs and b) be operable in a manner that is quick, simple (so that even a panicking user with impaired executive function can operate it), and, usually, c) be obvious even to an untrained operator or a bystander We hit on a) and b), the c) part might not be there yet. The issue here is that the command is respected on the master cluster (when ran there) but not on the slave cluster (when ran there). If you stop replication on the master, the logs are no longer stored to be pushed down stream like they would with replication enabled. Yep. The bug, however, causes the slave to keep accepting logs even while disabled although the other processes on slave cluster respect the disabled flag Since it's a kill switch, what's going to happen is the slave cluster is going to drop the log edits . This is not what you want, you want is HBASE-3134 . So, afaik, running commands on the slave cluster are futile as its the master cluster which does all the work. I think you understand the issue here reasonably well, and indeed most of the commands won't do anything on the slave cluster, except here the kill switch should stop all replication-related activity including applying incoming logs.
        Hide
        Jean-Daniel Cryans added a comment -

        The kill switch was completely removed, closing.

        Show
        Jean-Daniel Cryans added a comment - The kill switch was completely removed, closing.

          People

          • Assignee:
            Unassigned
            Reporter:
            Josh Wymer
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development