HBase
  1. HBase
  2. HBASE-9303

Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online'

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.95.2
    • Fix Version/s: 0.98.0, 0.94.12, 0.96.0
    • Component/s: None
    • Labels:
      None

      Description

      Take snapshot of a table ('tablethree' in the log).
      Put some data in the table and split the table.
      Restore snapshot.
      Table cannot be enabled due to:

      Thu Aug 22 19:37:20 UTC 2013, org.apache.hadoop.hbase.client.RpcRetryingCaller@47a6ac39, org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: c32e63d8c8a1a94b68966645b956d86d
      	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2557)
      	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3921)
      	at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:2996)
      	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26847)
      
      1. hbase-hbase-master-hor15n02.log
        1.03 MB
        Ted Yu
      2. hbase-hbase-regionserver-hor15n02.log
        2.39 MB
        Ted Yu
      3. 9303.shell-output
        3 kB
        Ted Yu
      4. HBASE-9303-trunk-v0.patch
        3 kB
        Matteo Bertozzi
      5. HBASE-9303-0.94-v0.patch
        3 kB
        Matteo Bertozzi
      6. HBASE-9303-0.94-v1.patch
        3 kB
        Matteo Bertozzi
      7. HBASE-9303-trunk-v1.patch
        3 kB
        Matteo Bertozzi
      8. HBASE-9303-trunk-v2.patch
        5 kB
        Matteo Bertozzi
      9. HBASE-9303-0.94-v2.patch
        4 kB
        Matteo Bertozzi

        Activity

        Hide
        Ted Yu added a comment -

        Logs files.

        Snapshot for tablethree was taken.
        When restoring, region c32e63d8c8a1a94b68966645b956d86d couldn't get online.

        Show
        Ted Yu added a comment - Logs files. Snapshot for tablethree was taken. When restoring, region c32e63d8c8a1a94b68966645b956d86d couldn't get online.
        Hide
        Matteo Bertozzi added a comment -

        added a safe patch, to avoid all the problems.. reset all the region state to OFFLINE and clears completely .META. for the restored table.

        Even if 94 doesn't seems to be affected, probably because the AssignmentManager is less strict, I'm going to apply the patch to 94 too. the patch should be the same.

        Show
        Matteo Bertozzi added a comment - added a safe patch, to avoid all the problems.. reset all the region state to OFFLINE and clears completely .META. for the restored table. Even if 94 doesn't seems to be affected, probably because the AssignmentManager is less strict, I'm going to apply the patch to 94 too. the patch should be the same.
        Hide
        Jonathan Hsieh added a comment -

        What is special about this c32e63 region? is this a split parent or daughter?

        Show
        Jonathan Hsieh added a comment - What is special about this c32e63 region? is this a split parent or daughter?
        Hide
        Matteo Bertozzi added a comment -

        Jonathan Hsieh split parent, the assignment manger will not try to assign a SPLIT and the restore will not rewrite the region as "normal region" but it will leave the SPLIT attribute

        Show
        Matteo Bertozzi added a comment - Jonathan Hsieh split parent, the assignment manger will not try to assign a SPLIT and the restore will not rewrite the region as "normal region" but it will leave the SPLIT attribute
        Hide
        Ted Yu added a comment -

        It was the parent:

        2013-08-22 09:39:22,234 INFO  [AM.ZK.Worker-pool-2-thread-1293] master.RegionStates: Transitioned from {c32e63d8c8a1a94b68966645b956d86d state=OPEN, ts=1377164322342, server=hor15n02,60020,1377152759584} to {c32e63d8c8a1a94b68966645b956d86d state=SPLITTING, ts=1377164362234, server=hor15n02,60020,1377152759584}
        
        Show
        Ted Yu added a comment - It was the parent: 2013-08-22 09:39:22,234 INFO [AM.ZK.Worker-pool-2-thread-1293] master.RegionStates: Transitioned from {c32e63d8c8a1a94b68966645b956d86d state=OPEN, ts=1377164322342, server=hor15n02,60020,1377152759584} to {c32e63d8c8a1a94b68966645b956d86d state=SPLITTING, ts=1377164362234, server=hor15n02,60020,1377152759584}
        Hide
        Jonathan Hsieh added a comment -

        Ted Yu, Matteo Bertozzi can you explain the internal chain of events / root cause of the problem is? all I know currently is something with restore, split, and the content of meta.

        Show
        Jonathan Hsieh added a comment - Ted Yu , Matteo Bertozzi can you explain the internal chain of events / root cause of the problem is? all I know currently is something with restore, split, and the content of meta.
        Hide
        Jonathan Hsieh added a comment -

        Ok, so we have a split parent with the SPLIT marker in the hri in meta. Is this the hri saved off as part of the snapshot manifest, or is it only in the meta hri? Is the problem only that the meta entry has SPLIT as an attribute?

        Could we have a unit test where we force meta to have the split marker and the restore?

        Show
        Jonathan Hsieh added a comment - Ok, so we have a split parent with the SPLIT marker in the hri in meta. Is this the hri saved off as part of the snapshot manifest, or is it only in the meta hri? Is the problem only that the meta entry has SPLIT as an attribute? Could we have a unit test where we force meta to have the split marker and the restore?
        Hide
        Matteo Bertozzi added a comment -

        let me try to explain the problem...

        the restore tries to do a diff between the current state and the snapshot state
        snapshot regions already present in the current state will not be removed from META
        this means that your meta will contain extra information from the current state (not present at the time of the snapshot)

        In this specific case, the parent split region is now marked as SPLIT. so the assignment manager knows that this region should never be assigned again.
        At this point we restore... the current split parent goes back to be a normal region.. but in meta and in the assignment manager memory state is still marked as split.. so not assigned...
        At this point your table has a missing region that will never go online.

        Show
        Matteo Bertozzi added a comment - let me try to explain the problem... the restore tries to do a diff between the current state and the snapshot state snapshot regions already present in the current state will not be removed from META this means that your meta will contain extra information from the current state (not present at the time of the snapshot) In this specific case, the parent split region is now marked as SPLIT. so the assignment manager knows that this region should never be assigned again. At this point we restore... the current split parent goes back to be a normal region.. but in meta and in the assignment manager memory state is still marked as split.. so not assigned... At this point your table has a missing region that will never go online.
        Hide
        Ted Yu added a comment -

        +1 on patch v1.

        Show
        Ted Yu added a comment - +1 on patch v1.
        Hide
        Jonathan Hsieh added a comment -

        ok, this explanation is really helpful, and I buy it. There are some subtle things going on here, so can we add comments in the code about why there are two mutate calls and why we must first delete and then rewrite (instead of reusing like we did before)?

        Show
        Jonathan Hsieh added a comment - ok, this explanation is really helpful, and I buy it. There are some subtle things going on here, so can we add comments in the code about why there are two mutate calls and why we must first delete and then rewrite (instead of reusing like we did before)?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12599529/HBASE-9303-trunk-v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12599529/HBASE-9303-trunk-v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6849//console This message is automatically generated.
        Hide
        Jonathan Hsieh added a comment -

        v2 is lovely. +1.

        Show
        Jonathan Hsieh added a comment - v2 is lovely. +1.
        Hide
        Lars Hofhansl added a comment -

        Looks good. Need this in 0.94 as well.
        (It's a bit disconcerting that we have missed this up to now)

        Show
        Lars Hofhansl added a comment - Looks good. Need this in 0.94 as well. (It's a bit disconcerting that we have missed this up to now)
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.94-security #270 (See https://builds.apache.org/job/HBase-0.94-security/270/)
        HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517021)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.94-security #270 (See https://builds.apache.org/job/HBase-0.94-security/270/ ) HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517021) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.94 #1124 (See https://builds.apache.org/job/HBase-0.94/1124/)
        HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517021)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.94 #1124 (See https://builds.apache.org/job/HBase-0.94/1124/ ) HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517021) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #695 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/695/)
        HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517022)

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #695 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/695/ ) HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517022) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.95 #490 (See https://builds.apache.org/job/hbase-0.95/490/)
        HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517020)

        • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.95 #490 (See https://builds.apache.org/job/hbase-0.95/490/ ) HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517020) /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.95-on-hadoop2 #271 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/271/)
        HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517020)

        • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.95-on-hadoop2 #271 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/271/ ) HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517020) /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK #4431 (See https://builds.apache.org/job/HBase-TRUNK/4431/)
        HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517022)

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #4431 (See https://builds.apache.org/job/HBase-TRUNK/4431/ ) HBASE-9303 Snapshot restore of table which splits after snapshot was taken encounters 'Region is not online' (mbertozzi: rev 1517022) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/RestoreSnapshotHandler.java

          People

          • Assignee:
            Matteo Bertozzi
            Reporter:
            Ted Yu
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development