HBase
  1. HBase
  2. HBASE-3937

Region PENDING-OPEN timeout with un-expected ZK node state leads to an endless loop

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 0.90.3
    • Fix Version/s: None
    • Component/s: master
    • Labels:
      None

      Description

      I describe the scenario of how this problem happened:

      1.HMaster assigned the region A to RS1. So the RegionState was set to PENDING_OPEN.
      2.For there's too many opening requests, the open process on RS1 was blocked.
      3.Some time later, TimeoutMonitor found the assigning of A was timeout. For the RegionState was in PENDING_OPEN, went into the following handler process(Just put the region into an waiting-assigning set):

      case PENDING_OPEN:
      LOG.info("Region has been PENDING_OPEN for too " +
      "long, reassigning region=" +
      regionInfo.getRegionNameAsString());
      assigns.put(regionState.getRegion(), Boolean.TRUE);
      break;
      So we can see that, under this case, we consider the ZK node state was OFFLINE. Indeed, in an normal disposal, it's OK.

      4.But before the real-assigning, the requests of RS1 was disposed. So that affected the new-assigning. For it update the ZK node state from OFFLINE to OPENING.

      5.The new assigning started, so it send region to open in RS2. But while the opening, it should update the ZK node state from OFFLINE to OPENING. For the current state is OPENING, so this operation failed.
      So this region couldn't be open success anymore.

      So I think, to void this problem , under the case of PENDING_OPEN of TiemoutMonitor, we should transform the ZK node state to OFFLINE first.

        Activity

        Hide
        stack added a comment -

        Resolving as invalid. A bunch of work has been done in AM since 0.90. This issue if it still exists will have a new form.

        Show
        stack added a comment - Resolving as invalid. A bunch of work has been done in AM since 0.90. This issue if it still exists will have a new form.
        Hide
        Lars Hofhansl added a comment -

        @Jieshan, J-D: are you guys still working on this? Is it still a problem?

        Show
        Lars Hofhansl added a comment - @Jieshan, J-D: are you guys still working on this? Is it still a problem?
        Hide
        Lars Hofhansl added a comment -

        No movement in over 8 months. Moving out of 0.94.

        Show
        Lars Hofhansl added a comment - No movement in over 8 months. Moving out of 0.94.
        Hide
        stack added a comment -

        Moving out of 0.90.4 at J-D's suggestion. Move it back in if you think different Jieshan.

        Show
        stack added a comment - Moving out of 0.90.4 at J-D's suggestion. Move it back in if you think different Jieshan.
        Hide
        Jean-Daniel Cryans added a comment -

        I'm not sure that patch would be better, starting with the fact that it copies a bunch of code from the next switch case.

        Thinking more about this problem, I believe that in your original case you almost had a double assignment (and the patch you propose would really make it a double assignment).

        Let's say the region times out on PENDING_OPEN but by the time it gets processed it's already opened by the RS. What you had originally is that it will keep bouncing because RS2 can't open the region, but now it should be able to assign it since the ZK state is cleared.

        It's still unclear to me why your RS1 didn't go through and finally opened it (it should be in your logs tho), but we have to consider both possibilities.

        I'm starting to think that there won't be any easy solution, we need to rewrite how TimeoutMonitor does its thing. Anything else would just be bandaids that will never fix all the problems.

        The way it should work is the following:

        • It should not create a list of unassigns and assigns, since by the time the list is processed the situation probably changed (I witnessed that a lot).
        • This means the action should be taken as we go through first loop.
        • One of the major issues is the lack of atomicity, so any action taken should first check the current state, keep the version number, decide of the corrective measure and update the znode by expecting the version it first got.
        • If the updating of the znode is successful, we know for sure that the operation will be seen by the region servers.
        • If it's not successful, the situation needs to be reassessed.

        This is clearly not something for 0.90, that's one of the reasons in 0.90.3 we set the timeout much higher than 30 seconds. That's my conclusion at the end of HBASE-3669.

        Show
        Jean-Daniel Cryans added a comment - I'm not sure that patch would be better, starting with the fact that it copies a bunch of code from the next switch case. Thinking more about this problem, I believe that in your original case you almost had a double assignment (and the patch you propose would really make it a double assignment). Let's say the region times out on PENDING_OPEN but by the time it gets processed it's already opened by the RS. What you had originally is that it will keep bouncing because RS2 can't open the region, but now it should be able to assign it since the ZK state is cleared. It's still unclear to me why your RS1 didn't go through and finally opened it (it should be in your logs tho), but we have to consider both possibilities. I'm starting to think that there won't be any easy solution, we need to rewrite how TimeoutMonitor does its thing. Anything else would just be bandaids that will never fix all the problems. The way it should work is the following: It should not create a list of unassigns and assigns, since by the time the list is processed the situation probably changed (I witnessed that a lot). This means the action should be taken as we go through first loop. One of the major issues is the lack of atomicity, so any action taken should first check the current state, keep the version number, decide of the corrective measure and update the znode by expecting the version it first got. If the updating of the znode is successful, we know for sure that the operation will be seen by the region servers. If it's not successful, the situation needs to be reassessed. This is clearly not something for 0.90, that's one of the reasons in 0.90.3 we set the timeout much higher than 30 seconds. That's my conclusion at the end of HBASE-3669 .
        Hide
        Jieshan Bean added a comment -

        How about just modify the case of PENDING_OPEN as following? Or just modified the assign method suggested by J-D?

         case PENDING_OPEN:
                        LOG.info("Region has been PENDING_OPEN for too " +
                            "long, reassigning region=" +
                            regionInfo.getRegionNameAsString());
                        
                        // when is the ZK of state OPENING or others,Change into OFFLINE
                        String pendingNode = ZKAssign.getNodeName(watcher,
                            regionInfo.getEncodedName());
                        Stat pendingStat = new Stat();
                        try {
                          RegionTransitionData pendingData = ZKAssign.getDataNoWatch(
                              watcher, pendingNode, pendingStat);
                          if ((null != pendingData)
                              && (pendingData.getEventType() != EventType.M_ZK_REGION_OFFLINE)) {
                            pendingData = new RegionTransitionData(
                                EventType.M_ZK_REGION_OFFLINE,
                                regionInfo.getRegionName(), master.getServerName());
                            if (ZKUtil.setData(watcher, pendingNode,
                                pendingData.getBytes(), pendingStat.getVersion())) {
                              // Node is now OFFLINE, let's trigger another assignment
                              ZKUtil.getDataAndWatch(watcher, pendingNode);
                              LOG.info("Successfully transitioned region="
                                  + regionInfo.getRegionNameAsString() + " from "
                                  + pendingData.getEventType()
                                  + " to OFFLINE and forcing a new assignment.");
                            }
                          }
                        } catch (KeeperException ke) {
                          LOG.error("ZK KeeperException timing out CLOSING region", ke);
                        }
                        
                        assigns.put(regionState.getRegion(), Boolean.TRUE);
                        break;
        
        Show
        Jieshan Bean added a comment - How about just modify the case of PENDING_OPEN as following? Or just modified the assign method suggested by J-D? case PENDING_OPEN: LOG.info("Region has been PENDING_OPEN for too " + "long, reassigning region=" + regionInfo.getRegionNameAsString()); // when is the ZK of state OPENING or others,Change into OFFLINE String pendingNode = ZKAssign.getNodeName(watcher, regionInfo.getEncodedName()); Stat pendingStat = new Stat(); try { RegionTransitionData pendingData = ZKAssign.getDataNoWatch( watcher, pendingNode, pendingStat); if ((null != pendingData) && (pendingData.getEventType() != EventType.M_ZK_REGION_OFFLINE)) { pendingData = new RegionTransitionData( EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(), master.getServerName()); if (ZKUtil.setData(watcher, pendingNode, pendingData.getBytes(), pendingStat.getVersion())) { // Node is now OFFLINE, let's trigger another assignment ZKUtil.getDataAndWatch(watcher, pendingNode); LOG.info("Successfully transitioned region=" + regionInfo.getRegionNameAsString() + " from " + pendingData.getEventType() + " to OFFLINE and forcing a new assignment."); } } } catch (KeeperException ke) { LOG.error("ZK KeeperException timing out CLOSING region", ke); } assigns.put(regionState.getRegion(), Boolean.TRUE); break;
        Hide
        Jieshan Bean added a comment -

        I have thought about that. But I'm afraid another problem. For it force all the ZK nodes related to the regions in RIT to Offline each time. If the original state is Offline, it will reset again. I don't know whether it is a problem.

        Show
        Jieshan Bean added a comment - I have thought about that. But I'm afraid another problem. For it force all the ZK nodes related to the regions in RIT to Offline each time. If the original state is Offline, it will reset again. I don't know whether it is a problem.
        Hide
        Jean-Daniel Cryans added a comment -

        Yeah TimeoutMonitor is really prone to race conditions...

        So would changing this line in TimeoutMonitor.chore fix it?

        for (Map.Entry<HRegionInfo, Boolean> e: assigns.entrySet()){
          assign(e.getKey(), false, e.getValue());
        }
        

        to

        for (Map.Entry<HRegionInfo, Boolean> e: assigns.entrySet()){
          assign(e.getKey(), true, e.getValue());
        }
        
        Show
        Jean-Daniel Cryans added a comment - Yeah TimeoutMonitor is really prone to race conditions... So would changing this line in TimeoutMonitor.chore fix it? for (Map.Entry<HRegionInfo, Boolean > e: assigns.entrySet()){ assign(e.getKey(), false , e.getValue()); } to for (Map.Entry<HRegionInfo, Boolean > e: assigns.entrySet()){ assign(e.getKey(), true , e.getValue()); }

          People

          • Assignee:
            Jieshan Bean
            Reporter:
            Jieshan Bean
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development