HBase
  1. HBase
  2. HBASE-6050

HLogSplitter renaming recovered.edits and CJ removing the parent directory race, making the HBCK think cluster is inconsistent.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.1, 0.94.0
    • Fix Version/s: 0.94.1, 0.95.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The scenario is like this
      -> A region is getting splitted.
      -> The master is still not processed the split .
      -> Region server goes down.
      -> Split log manager starts splitting the logs and creates the recovered.edits in the splitlog path.
      -> CJ starts and deletes the entry from META and also just completes the deletion of the region dir.
      -> in hlogSplitter on final step we rename the recovered.edits to come under the regiondir.
      There if the regiondir doesnot exist we tend to create and then add the recovered.edits.

      Because of this HBCK thinks it to be an orphan region because we have the regiondir but with no regioninfo.
      Ideally cluster is fine but we it is misleading.

              } else {
                Path dstdir = dst.getParent();
                if (!fs.exists(dstdir)) {
                  if (!fs.mkdirs(dstdir)) LOG.warn("mkdir failed on " + dstdir);
                }
              }
              fs.rename(src, dst);
              LOG.debug(" moved " + src + " => " + dst);
            } else {
              LOG.debug("Could not move recovered edits from " + src +
                  " as it doesn't exist");
            }
          }
          archiveLogs(null, corruptedLogs, processedLogs,
              oldLogDir, fs, conf);
      
      1. HBASE-6050.patch
        0.9 kB
        ramkrishna.s.vasudevan

        Activity

        Hide
        ramkrishna.s.vasudevan added a comment -

        Why are we trying to create the dstdir? What is the reason for it?
        Is the fix to be applied here or on the HBCK side so that he does not think that there is some inconsistency?
        But if we make this change in HBCK we are not sure how to delete the recovered.edits file created because master will never try to open this region?

        Show
        ramkrishna.s.vasudevan added a comment - Why are we trying to create the dstdir? What is the reason for it? Is the fix to be applied here or on the HBCK side so that he does not think that there is some inconsistency? But if we make this change in HBCK we are not sure how to delete the recovered.edits file created because master will never try to open this region?
        Hide
        stack added a comment -

        Good one Ram.

        So, we are talking about the parent region?

        It does seem wrong that we would recreate a parent region dir in the distributed log splitter.

        How about we remove that dir creation code? I can see our making the recovered.edits dir because it won't always be there but creating all of its parent dirs is not right. My guess is that the mkdirs was done because it was just easier than verifying parent dir present.

        If parent dir not present, log the fact that there is no target region into which to put the edits and move on I'd say.

        Show
        stack added a comment - Good one Ram. So, we are talking about the parent region? It does seem wrong that we would recreate a parent region dir in the distributed log splitter. How about we remove that dir creation code? I can see our making the recovered.edits dir because it won't always be there but creating all of its parent dirs is not right. My guess is that the mkdirs was done because it was just easier than verifying parent dir present. If parent dir not present, log the fact that there is no target region into which to put the edits and move on I'd say.
        Hide
        ramkrishna.s.vasudevan added a comment -

        So, we are talking about the parent region?

        Yes it is the parent region.

        If parent dir not present, log the fact that there is no target region into which to put the edits and move on I'd say

        Yes if destination does not exist we can move one and so we will consider the log splitting process successful.
        But the file created in the splitlog folder by the distributed log splitting will never be cleared i think.? May be i need to check the code on that. I will come up with a patch on this tomorrow.

        Show
        ramkrishna.s.vasudevan added a comment - So, we are talking about the parent region? Yes it is the parent region. If parent dir not present, log the fact that there is no target region into which to put the edits and move on I'd say Yes if destination does not exist we can move one and so we will consider the log splitting process successful. But the file created in the splitlog folder by the distributed log splitting will never be cleared i think.? May be i need to check the code on that. I will come up with a patch on this tomorrow.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Trunk patch. Pls provide your comments.

        Show
        ramkrishna.s.vasudevan added a comment - Trunk patch. Pls provide your comments.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12528231/HBASE-6050.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 33 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.replication.TestReplication
        org.apache.hadoop.hbase.replication.TestMultiSlaveReplication
        org.apache.hadoop.hbase.replication.TestMasterReplication

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1942//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1942//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1942//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12528231/HBASE-6050.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 33 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.replication.TestMasterReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1942//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1942//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1942//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Replication related testcases are failing in the previous few QA builds.
        So this patch has not introduced it.

        Show
        ramkrishna.s.vasudevan added a comment - Replication related testcases are failing in the previous few QA builds. So this patch has not introduced it.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Pls share your comments on this patch?

        Show
        ramkrishna.s.vasudevan added a comment - Pls share your comments on this patch?
        Hide
        ramkrishna.s.vasudevan added a comment -

        Pls share your comments on this patch? If it is ok i can prepare for other versions also.

        Show
        ramkrishna.s.vasudevan added a comment - Pls share your comments on this patch? If it is ok i can prepare for other versions also.
        Hide
        Ted Yu added a comment -

        Patch looks good.
        Minor:
        Please insert spaces around regionDir:

        +            " to destination " +regionDir+ " as it doesn't exist.");
        
        Show
        Ted Yu added a comment - Patch looks good. Minor: Please insert spaces around regionDir: + " to destination " +regionDir+ " as it doesn't exist." );
        Hide
        ramkrishna.s.vasudevan added a comment -

        Thanks Ted. Will prepare patches for 0.92 and 0.94 and commit them later today in the evening if there is no objection.

        Show
        ramkrishna.s.vasudevan added a comment - Thanks Ted. Will prepare patches for 0.92 and 0.94 and commit them later today in the evening if there is no objection.
        Hide
        Jonathan Hsieh added a comment -

        Just for clarification - this edits are actually replayed to the daughter regions and these recovered.edits files are kept around for something (the CJ?) to eventually clean up?

        Show
        Jonathan Hsieh added a comment - Just for clarification - this edits are actually replayed to the daughter regions and these recovered.edits files are kept around for something (the CJ?) to eventually clean up?
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Jon
        In our case the split got completed and the RS went down due to ZK issue and that is why the Master was not able to respond to the split region completion. Because the RS went down the recovered.edits creation came into play.
        Ideally CJ just cleans up the entire region directory because the parent is in splitted state and offlined. Also in this case as the split is completed we are sure that the data is also flushed to store files. Daughter regions will have its own region directory.
        Did i answer your question?

        Show
        ramkrishna.s.vasudevan added a comment - @Jon In our case the split got completed and the RS went down due to ZK issue and that is why the Master was not able to respond to the split region completion. Because the RS went down the recovered.edits creation came into play. Ideally CJ just cleans up the entire region directory because the parent is in splitted state and offlined. Also in this case as the split is completed we are sure that the data is also flushed to store files. Daughter regions will have its own region directory. Did i answer your question?
        Hide
        ramkrishna.s.vasudevan added a comment -

        I will commit this tomorrow morning.

        Show
        ramkrishna.s.vasudevan added a comment - I will commit this tomorrow morning.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Committed to trunk, 0.94 and 0.92.
        Thanks for review Ted and Jon.
        Thanks Stack for your idea.
        P.S. committed a small addendum for HBASE-6002 for 0.92 only as both were part of HLogSplitter.

        Show
        ramkrishna.s.vasudevan added a comment - Committed to trunk, 0.94 and 0.92. Thanks for review Ted and Jon. Thanks Stack for your idea. P.S. committed a small addendum for HBASE-6002 for 0.92 only as both were part of HLogSplitter.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2925 (See https://builds.apache.org/job/HBase-TRUNK/2925/)
        HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342937)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2925 (See https://builds.apache.org/job/HBase-TRUNK/2925/ ) HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342937) Result = FAILURE ramkrishna : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #221 (See https://builds.apache.org/job/HBase-0.94/221/)
        HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342934)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #221 (See https://builds.apache.org/job/HBase-0.94/221/ ) HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342934) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #424 (See https://builds.apache.org/job/HBase-0.92/424/)
        HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. and a small addendum for HBASE-6002 (Ram) (Revision 1342935)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #424 (See https://builds.apache.org/job/HBase-0.92/424/ ) HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. and a small addendum for HBASE-6002 (Ram) (Revision 1342935) Result = FAILURE ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #18 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/18/)
        HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342937)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #18 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/18/ ) HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342937) Result = FAILURE ramkrishna : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/)
        HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342934)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/ ) HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. (Ram) (Revision 1342934) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/)
        HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. and a small addendum for HBASE-6002 (Ram) (Revision 1342935)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/ ) HBASE-6050 HLogSplitter renaming recovered.edits and CJ removing the parent directory races, making the HBCK to think cluster is inconsistent. and a small addendum for HBASE-6002 (Ram) (Revision 1342935) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            ramkrishna.s.vasudevan
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development