HBase
  1. HBase
  2. HBASE-10556

Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.96.2, 0.98.1, 0.99.0
    • Component/s: regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      During the code review when investigating HBASE-10499, a possibility of data loss due to non-handled DroppedSnapshotException for user-triggered flush is exposed.

      Data loss can happen as below:

      1. A flush for some region is triggered via HBaseAdmin or shell
      2. The request reaches regionserver and eventually HRegion.internalFlushcache is called, then fails at persisting memstore's snapshot to hfile, DroppedSnapshotException is thrown and the snapshot is left not cleared.
      3. DroppedSnapshotException is not handled in HRegion, and is just encapsulated as a ServiceException before returning to client
      4. After a while, some new writes are handled and put in the current memstore, then a new flush is triggered for the region due to memstoreSize exceeds flush threshold
      5. This second(new) flush succeeds, for the HStore which failed in the previous user-triggered flush, the remained non-empty snapshot is used rather than a new snapshot made from the current memstore, but HLog's latest sequenceId is used for the resultant hfiles — the sequenceId attached within the hfiles says all edits with sequenceId <= it have all been persisted, but actually it's not the truth for the edits still in the existing memstore
      6. Now the regionserver hosting this region dies
      7. During the replay phase of failover, the edits corresponding to the ones while are in memstore and not actually persisted in hfiles when the previous regionserver dies will be ignored, since they are deemed as persisted by compared to the hfiles' latest consequenceID — These edits are lost...

      For the second flush, we also can't discard the remained snapshot and make a new one using current memstore, that way the data in the remained snapshot is lost. We should abort the regionserver immediately and rely on the failover to replay the log for data safety.

      DroppedSnapshotException is correctly handled in MemStoreFlusher for internally triggered flush (which are generated by flush-size / rollWriter / periodicFlusher). But user-triggered flush is processed directly by HRegionServer->HRegion without putting a flush entry to flushQueue, hence not handled by MemStoreFlusher

        Activity

        Hide
        Enis Soztutar added a comment -

        Closing this issue after 0.99.0 release.

        Show
        Enis Soztutar added a comment - Closing this issue after 0.99.0 release.
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-TRUNK-on-Hadoop-1.1 #99 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/99/)
        HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571501)

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-TRUNK-on-Hadoop-1.1 #99 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/99/ ) HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571501) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.96-hadoop2 #215 (See https://builds.apache.org/job/hbase-0.96-hadoop2/215/)
        HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571503)

        • /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.96-hadoop2 #215 (See https://builds.apache.org/job/hbase-0.96-hadoop2/215/ ) HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571503) /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #170 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/170/)
        HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571502)

        • /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #170 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/170/ ) HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571502) /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.96 #312 (See https://builds.apache.org/job/hbase-0.96/312/)
        HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571503)

        • /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.96 #312 (See https://builds.apache.org/job/hbase-0.96/312/ ) HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571503) /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.98 #182 (See https://builds.apache.org/job/HBase-0.98/182/)
        HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571502)

        • /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.98 #182 (See https://builds.apache.org/job/HBase-0.98/182/ ) HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571502) /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK #4950 (See https://builds.apache.org/job/HBase-TRUNK/4950/)
        HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571501)

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #4950 (See https://builds.apache.org/job/HBase-TRUNK/4950/ ) HBASE-10556 Possible data loss due to non-handled DroppedSnapshotException for user-triggered flush from client/shell (stack: rev 1571501) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        stack added a comment -

        Committed to 0.96-0.99. Thanks for the patch Honghua Feng

        Show
        stack added a comment - Committed to 0.96-0.99. Thanks for the patch Honghua Feng
        Hide
        Honghua Feng added a comment -

        Ping again...

        Show
        Honghua Feng added a comment - Ping again...
        Hide
        Honghua Feng added a comment -

        Thanks Ted Yu for the review, and ping Lars Hofhansl, stack and Andrew Purtell for review and another +1, thanks

        Show
        Honghua Feng added a comment - Thanks Ted Yu for the review, and ping Lars Hofhansl , stack and Andrew Purtell for review and another +1, thanks
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12629332/HBASE-10556-trunk_v1.patch
        against trunk revision .
        ATTACHMENT ID: 12629332

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop1.1. The patch compiles against the hadoop 1.1 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12629332/HBASE-10556-trunk_v1.patch against trunk revision . ATTACHMENT ID: 12629332 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop1.1 . The patch compiles against the hadoop 1.1 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8729//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        +1 if tests pass.

        Show
        Ted Yu added a comment - +1 if tests pass.
        Hide
        Honghua Feng added a comment -

        This bug should exist for all branches

        Show
        Honghua Feng added a comment - This bug should exist for all branches
        Hide
        Honghua Feng added a comment -

        patch attached

        Show
        Honghua Feng added a comment - patch attached

          People

          • Assignee:
            Honghua Feng
            Reporter:
            Honghua Feng
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development