Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.0
    • Fix Version/s: 0.90.0
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      the HLog.splitLog got really long and complex and hard to verify for correctness.
      I started to refactor it and also ported changes from hbase-2337 that deals with premature deletion of log files in case of errors. Further improvements will be possible, however the scope of this issue is to clean the code and make it behave correctly (i.e. not lose any edits)

      Added a suite of unit tests that might be ported to 0.20 as well.

      Added a setting (hbase.skip.errors - feel free to suggest a better name) that, when set to false will make the process less tolerant to failures or corrupted files: in case a log file is corrupted or an error stops the process from consistently splitting the log, will abort the entire operation to avoid losing any edits. When hbase.skip.errors is on any corrupted files will be partially parsed and then moved to the corrupted logs archive (see hbase-2337).

      Like hbase-2337 the splitLog method will first split all the logs and then proceed to archive them. If any splitted log file (oldlogfile.log) that is the result of an earlier splitLog attempt is found in the region directory, it will be deleted - this is safe since we won't move the original log files until the splitLog process completes.

      1. HBASE-2437-v5.patch
        65 kB
        Cosmin Lehene
      2. 2437-v4.patch
        49 kB
        stack
      3. 2437-v3.txt
        49 kB
        stack
      4. 2437-v2.txt
        61 kB
        stack
      5. 2437.txt
        49 kB
        stack
      6. HBASE-2437_for_HBase-0.21_with_unit_tests_for_HDFS-0.21.patch
        42 kB
        Cosmin Lehene

        Issue Links

          Activity

          Hide
          Nicolas Spiegelberg added a comment -

          Hey, late peer review, I know. I had a couple things I saw during the peer review that I would like to get resolution on...

          1. FSUtils.recoverFileLease() - does the InterruptedException handler need to set Thread.currentThread().interrupt()? What about users actually trying to kill the master?
          2. parseHLog() - the comment says that len==0 can still happen with append support? At least for 0.20, that's only if the file wasn't closed. However, we just did that immediately before in recoverFileLease. don't mind the code, but comment should change or be clarified
          3. writeEditsBatchToRegions() - same thing with InterruptedException. maybe want to have something higher up the call stack notice isInterrupted() and display a Log.info() message. At the very least, we definitely don't want to delete the log directory in splitLog() if the user interrupts these threads and skipErrors==true
          4. archiveLogs() - by default, we are archiving successfully split logs ala 'processedLogs'? I'm not sure we want to do that by default. I think people are mainly interested in problematic logs that couldn't survive the transition. Having this as an optional toggle is okay, but a naive user wouldn't know he has these trash items.

          Show
          Nicolas Spiegelberg added a comment - Hey, late peer review, I know. I had a couple things I saw during the peer review that I would like to get resolution on... 1. FSUtils.recoverFileLease() - does the InterruptedException handler need to set Thread.currentThread().interrupt()? What about users actually trying to kill the master? 2. parseHLog() - the comment says that len==0 can still happen with append support? At least for 0.20, that's only if the file wasn't closed. However, we just did that immediately before in recoverFileLease. don't mind the code, but comment should change or be clarified 3. writeEditsBatchToRegions() - same thing with InterruptedException. maybe want to have something higher up the call stack notice isInterrupted() and display a Log.info() message. At the very least, we definitely don't want to delete the log directory in splitLog() if the user interrupts these threads and skipErrors==true 4. archiveLogs() - by default, we are archiving successfully split logs ala 'processedLogs'? I'm not sure we want to do that by default. I think people are mainly interested in problematic logs that couldn't survive the transition. Having this as an optional toggle is okay, but a naive user wouldn't know he has these trash items.
          Hide
          stack added a comment -

          Committed. Thanks for the nice patch Cosmin.

          Show
          stack added a comment - Committed. Thanks for the nice patch Cosmin.
          Hide
          stack added a comment -

          OK, after talking to Cosmin, going to make an issue to deal with the zombie testLogCannotBeWrittenOnceParsed issue. Cosmin thinks it may be an hdfs issue (HBASE-2645). Lets not let it get in the way of this commit and carry on investigation elsewhere. For now I'll comment out the test.

          Show
          stack added a comment - OK, after talking to Cosmin, going to make an issue to deal with the zombie testLogCannotBeWrittenOnceParsed issue. Cosmin thinks it may be an hdfs issue ( HBASE-2645 ). Lets not let it get in the way of this commit and carry on investigation elsewhere. For now I'll comment out the test.
          Hide
          stack added a comment -

          @Cosmin: Here is how to add your guava dependency:

          pynchon-8:trunk stack$ svn diff pom.xml
          Index: pom.xml
          ===================================================================
          --- pom.xml     (revision 950141)
          +++ pom.xml     (working copy)
          @@ -448,6 +448,7 @@
               <slf4j.version>1.5.8</slf4j.version>
               <stax-api>1.0.1</stax-api>
               <thrift.version>0.2.0</thrift.version>
          +    <guava.version>r03</guava.version>
             </properties>
           
             <dependencyManagement>
          @@ -697,10 +698,15 @@
                 <version>${commons-math.version}</version>
                 <scope>test</scope>
               </dependency>
          -        <dependency>
          +     <dependency>
                 <groupId>org.apache.hadoop</groupId>
                 <artifactId>hadoop-test</artifactId>
               </dependency>
          +     <dependency>
          +       <groupId>com.google.guava</groupId>
          +       <artifactId>guava</artifactId>
          +       <version>${guava.version}</version>
          +    </dependency>
             </dependencies>
           
             <!--
          
          Show
          stack added a comment - @Cosmin: Here is how to add your guava dependency: pynchon-8:trunk stack$ svn diff pom.xml Index: pom.xml =================================================================== --- pom.xml (revision 950141) +++ pom.xml (working copy) @@ -448,6 +448,7 @@ <slf4j.version>1.5.8</slf4j.version> <stax-api>1.0.1</stax-api> <thrift.version>0.2.0</thrift.version> + <guava.version>r03</guava.version> </properties> <dependencyManagement> @@ -697,10 +698,15 @@ <version>${commons-math.version}</version> <scope>test</scope> </dependency> - <dependency> + <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-test</artifactId> </dependency> + <dependency> + <groupId>com.google.guava</groupId> + <artifactId>guava</artifactId> + <version>${guava.version}</version> + </dependency> </dependencies> <!--
          Hide
          stack added a comment -

          @Cosmin I was trying to commit but testLogCannotBeWrittenOnceParsed fails for me every time too. Can you fix? This is only thing in way of a commit. I've made issues for whats still outstanding in the review over at review.hbase.org

          Show
          stack added a comment - @Cosmin I was trying to commit but testLogCannotBeWrittenOnceParsed fails for me every time too. Can you fix? This is only thing in way of a commit. I've made issues for whats still outstanding in the review over at review.hbase.org
          Hide
          Cosmin Lehene added a comment -

          I've noticed testLogCannotBeWrittenOnceParsed sometimes fails. I reduced the "zombie" sleep time to 1 and it fails most of the times.
          It seems the "zombie" thread can write and sync (+syncFs) after the log recovery.

          Show
          Cosmin Lehene added a comment - I've noticed testLogCannotBeWrittenOnceParsed sometimes fails. I reduced the "zombie" sleep time to 1 and it fails most of the times. It seems the "zombie" thread can write and sync (+syncFs) after the log recovery.
          Hide
          Cosmin Lehene added a comment -

          Changed to reflect http://review.hbase.org/r/74/
          I left some TODOs see review comments.

          Show
          Cosmin Lehene added a comment - Changed to reflect http://review.hbase.org/r/74/ I left some TODOs see review comments.
          Hide
          HBase Review Board added a comment -

          Message from: "Cosmin Lehene" <clehene@adobe.com>

          Show
          HBase Review Board added a comment - Message from: "Cosmin Lehene" <clehene@adobe.com>
          Hide
          HBase Review Board added a comment -

          Message from: "Todd Lipcon" <todd@cloudera.com>

          Oh, it's much much improved! Thanks! I still think a "high level overview" would be good to see.

          why .regionserver.? I'd say just hbase.hlog.split.batch.size or something

          if this applies only to hlog splitting, maybe hbase.hlog.split.skip.errors

          Which do you mean by "writers" here? I'd support factoring this function out into an HdfsUtil class somewhere.

          I agree if it has no records (I think - do we syncfs after writing the sequencefile header?). But there's the case where inside SequenceFile we call create, but never actually write any bytes. This is still worth recovering.

          In general I think a corrupt tail means we should drop that record (incomplete written record) but not shut down. This is only true if it's the tail record, though.

          Can't find it now... does my above comment make sense?

          This seems really voodoo.. if anything we're probably masking a real bug by doing this. Can you write a unit test which shows this problem (even if it takes 30 minutes to run, would be good to have in our arsenal)

          The case where this happens is if you crash in the middle of appending a long edit. Consider the case where a single edit might have 1MB of data (large rows). We can easily crash in the middle of transferring it, before we call sync on the edit. In this case, the client never received an "ack" for the write, so we can feel free to throw it away (this isn't data loss, it's correct operation).

          • Todd

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/74/#review40
          -----------------------------------------------------------

          On 2010-05-21 12:11:59, stack wrote:

          Show
          HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> Oh, it's much much improved! Thanks! I still think a "high level overview" would be good to see. why .regionserver.? I'd say just hbase.hlog.split.batch.size or something if this applies only to hlog splitting, maybe hbase.hlog.split.skip.errors Which do you mean by "writers" here? I'd support factoring this function out into an HdfsUtil class somewhere. I agree if it has no records (I think - do we syncfs after writing the sequencefile header?). But there's the case where inside SequenceFile we call create, but never actually write any bytes. This is still worth recovering. In general I think a corrupt tail means we should drop that record (incomplete written record) but not shut down. This is only true if it's the tail record, though. Can't find it now... does my above comment make sense? This seems really voodoo.. if anything we're probably masking a real bug by doing this. Can you write a unit test which shows this problem (even if it takes 30 minutes to run, would be good to have in our arsenal) The case where this happens is if you crash in the middle of appending a long edit. Consider the case where a single edit might have 1MB of data (large rows). We can easily crash in the middle of transferring it, before we call sync on the edit. In this case, the client never received an "ack" for the write, so we can feel free to throw it away (this isn't data loss, it's correct operation). Todd ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/74/#review40 ----------------------------------------------------------- On 2010-05-21 12:11:59, stack wrote:
          Hide
          HBase Review Board added a comment -

          Message from: "Cosmin Lehene" <clehene@adobe.com>

          correction: Todd suggested submitting a Callable<Void> to executor thread and then do a Future.get() and catch.

          changed it

          • Cosmin

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/74/#review40
          -----------------------------------------------------------

          On 2010-05-21 12:11:59, stack wrote:

          Show
          HBase Review Board added a comment - Message from: "Cosmin Lehene" <clehene@adobe.com> correction: Todd suggested submitting a Callable<Void> to executor thread and then do a Future.get() and catch. changed it Cosmin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/74/#review40 ----------------------------------------------------------- On 2010-05-21 12:11:59, stack wrote:
          Hide
          HBase Review Board added a comment -

          Message from: "Cosmin Lehene" <clehene@adobe.com>

          Let's have a separate issue for that

          done - for both reader and writer

          done - both reader and writer impl

          the original version of this function determined me to start refactoring in the first place. I'll add the description but if it's still confusing it might need more work.

          I know and tried to get a better name when created it. Can you suggest something better? I can't figure a short descriptive enough name

          perhaps hbase.regionserver.hlog.splitlog.batch.size?

          done

          done

          fixed

          done

          Fixed

          done

          done

          sync() used to call syncFs(). It looks like HBASE-2544 changed things a bit, but it doesn't only add the SequenceFile sync marker.

          I added this after I've seen inconsistent results when running splitLog on bigger hlogs. Try copying a log from the cluster locally and run splitLog from the command line a few times without flushing it after each append. I used to get inconsistent results between runs and calling sync fixed it.

          There's this "//TODO: test the split of a large (lots of regions > 500 file). In my tests it seems without hflush" in the TestHLogSplit.

          We could do some testing to figure out why would log entries be lost when running locally.

          What would be a better way to flush the writer?

          done

          fixed

          I don't know what it's supposed to mean either

          fixed

          I'd like to be able to investigate the trailing garbage. I don't think this should ever happen (do you see any scenarios?). If it did we might lose data. We used to fix NameNode edits for fsImage by adding a missing byte to a corrupted entry.

          I'd like to reflect more on this, maybe see other opinions.

          I'd rather have these differences dealt at the lowest level (writers) and abstracted than spread across code.
          What do you think?

          done
          I'll need help setting guava as a maven dependency.

          createNewSplitter is fine.
          It's a Callable, changed to submit in order to check for the result in case one of the writers failed (see comment below)

          more aspects here:
          I think the reported size will be >0 after recover, even if file has no records. I was asking if we should add logic to check if it's the last log.
          EOF for non zero length, non zero records file means file is corrupted.

          see above comment

          what's the other JIRA? see my above comments.

          my previous comment got lost somehow.
          Todd suggested submitting a Callable<Void> to executor thread.
          I wonder if we could use getCompletedTaskCount. Documentation sais it's an estimation, however it's an estimation only during runtime and seems to be correct after shutdown finishes (I looked in the source as well)

          Another option would be ExecutorCompletionService which seems to be suited for this kind of job.

          • Cosmin

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/74/#review40
          -----------------------------------------------------------

          On 2010-05-21 12:11:59, stack wrote:

          Show
          HBase Review Board added a comment - Message from: "Cosmin Lehene" <clehene@adobe.com> Let's have a separate issue for that done - for both reader and writer done - both reader and writer impl the original version of this function determined me to start refactoring in the first place. I'll add the description but if it's still confusing it might need more work. I know and tried to get a better name when created it. Can you suggest something better? I can't figure a short descriptive enough name perhaps hbase.regionserver.hlog.splitlog.batch.size? done done fixed done Fixed done done sync() used to call syncFs(). It looks like HBASE-2544 changed things a bit, but it doesn't only add the SequenceFile sync marker. I added this after I've seen inconsistent results when running splitLog on bigger hlogs. Try copying a log from the cluster locally and run splitLog from the command line a few times without flushing it after each append. I used to get inconsistent results between runs and calling sync fixed it. There's this "//TODO: test the split of a large (lots of regions > 500 file). In my tests it seems without hflush" in the TestHLogSplit. We could do some testing to figure out why would log entries be lost when running locally. What would be a better way to flush the writer? done fixed I don't know what it's supposed to mean either fixed I'd like to be able to investigate the trailing garbage. I don't think this should ever happen (do you see any scenarios?). If it did we might lose data. We used to fix NameNode edits for fsImage by adding a missing byte to a corrupted entry. I'd like to reflect more on this, maybe see other opinions. I'd rather have these differences dealt at the lowest level (writers) and abstracted than spread across code. What do you think? done I'll need help setting guava as a maven dependency. createNewSplitter is fine. It's a Callable, changed to submit in order to check for the result in case one of the writers failed (see comment below) more aspects here: I think the reported size will be >0 after recover, even if file has no records. I was asking if we should add logic to check if it's the last log. EOF for non zero length, non zero records file means file is corrupted. see above comment what's the other JIRA? see my above comments. my previous comment got lost somehow. Todd suggested submitting a Callable<Void> to executor thread. I wonder if we could use getCompletedTaskCount. Documentation sais it's an estimation, however it's an estimation only during runtime and seems to be correct after shutdown finishes (I looked in the source as well) Another option would be ExecutorCompletionService which seems to be suited for this kind of job. Cosmin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/74/#review40 ----------------------------------------------------------- On 2010-05-21 12:11:59, stack wrote:
          Hide
          HBase Review Board added a comment -

          Message from: "Cosmin Lehene" <clehene@adobe.com>

          Working on it. Some of the stuff might need more input

          • Cosmin

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/74/#review33
          -----------------------------------------------------------

          On 2010-05-21 12:11:59, stack wrote:

          Show
          HBase Review Board added a comment - Message from: "Cosmin Lehene" <clehene@adobe.com> Working on it. Some of the stuff might need more input Cosmin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/74/#review33 ----------------------------------------------------------- On 2010-05-21 12:11:59, stack wrote:
          Hide
          Cosmin Lehene added a comment -

          Thanks for review Todd!
          I started working on it. I'll post the changes/suggestions in the review

          Show
          Cosmin Lehene added a comment - Thanks for review Todd! I started working on it. I'll post the changes/suggestions in the review
          Hide
          Todd Lipcon added a comment -
          Show
          Todd Lipcon added a comment - Review comments at http://review.hbase.org/r/74/#review40
          Hide
          stack added a comment -

          OK. Posted to review.hbase.org.

          Show
          stack added a comment - OK. Posted to review.hbase.org.
          Hide
          stack added a comment -

          All tests pass now. Also moved the split log to regionserver.wal package from master package where it seems to better belong. Posted this patch to review.hbase.org.

          Show
          stack added a comment - All tests pass now. Also moved the split log to regionserver.wal package from master package where it seems to better belong. Posted this patch to review.hbase.org.
          Hide
          stack added a comment -

          This changed it:

          r806746 | cdouglas | 2009-08-21 15:50:10 -0700 (Fri, 21 Aug 2009) | 4 lines
          
          HDFS-538. Per the contract elucidated in HADOOP-6201, throw
          FileNotFoundException from FileSystem::listStatus rather than returning
          null. Contributed by Jakob Homan.
          
          Show
          stack added a comment - This changed it: r806746 | cdouglas | 2009-08-21 15:50:10 -0700 (Fri, 21 Aug 2009) | 4 lines HDFS-538. Per the contract elucidated in HADOOP-6201, throw FileNotFoundException from FileSystem::listStatus rather than returning null . Contributed by Jakob Homan.
          Hide
          Todd Lipcon added a comment -

          . fs.listStatus return null if file does not exist in 0.20 hdfs but throws FNFE in hadoop 0.21 (Need to verify this is indeed the case). This is going to cause fun when we go about the work making it so same hbase runs on both hadoop 0.20 and 0.21.

          If this is true it should be considered a bug in the 0.21 release, as the FileSystem API is supposed to be compatible between releases. Let me ping Tom.

          Show
          Todd Lipcon added a comment - . fs.listStatus return null if file does not exist in 0.20 hdfs but throws FNFE in hadoop 0.21 (Need to verify this is indeed the case). This is going to cause fun when we go about the work making it so same hbase runs on both hadoop 0.20 and 0.21. If this is true it should be considered a bug in the 0.21 release, as the FileSystem API is supposed to be compatible between releases. Let me ping Tom.
          Hide
          stack added a comment -

          This patch doesn't have the changes that were made by earlier patches adding startminidfscluster to HBaseTestingUtility. That was added as part of cleanup of HBaseTestingUtility over in HBASE-2587.

          This v3 is now failing just three of the ten or so new tests that Cosmin added. I'm going through them. In part, tests fail because hdfs-0.20 is different to hdfs-0.21 it seems; e.g. fs.listStatus return null if file does not exist in 0.20 hdfs but throws FNFE in hadoop 0.21 (Need to verify this is indeed the case). This is going to cause fun when we go about the work making it so same hbase runs on both hadoop 0.20 and 0.21.

          This patch adds some dirty reflection to allow setting a lower soft limit on file leases in namenode, a public method in hadoop 0.21 but not accessible in hdfs 0.20 thats necessary to a bunch of the new Cosmin tests.

          Show
          stack added a comment - This patch doesn't have the changes that were made by earlier patches adding startminidfscluster to HBaseTestingUtility. That was added as part of cleanup of HBaseTestingUtility over in HBASE-2587 . This v3 is now failing just three of the ten or so new tests that Cosmin added. I'm going through them. In part, tests fail because hdfs-0.20 is different to hdfs-0.21 it seems; e.g. fs.listStatus return null if file does not exist in 0.20 hdfs but throws FNFE in hadoop 0.21 (Need to verify this is indeed the case). This is going to cause fun when we go about the work making it so same hbase runs on both hadoop 0.20 and 0.21. This patch adds some dirty reflection to allow setting a lower soft limit on file leases in namenode, a public method in hadoop 0.21 but not accessible in hdfs 0.20 thats necessary to a bunch of the new Cosmin tests.
          Hide
          stack added a comment -

          Working on the split log tests. They depend on a feature that is in 0.21 that is not in 0.20 hdfs... being able to set namenode soft lease in unit tests. Still working on this.

          Show
          stack added a comment - Working on the split log tests. They depend on a feature that is in 0.21 that is not in 0.20 hdfs... being able to set namenode soft lease in unit tests. Still working on this.
          Hide
          stack added a comment -

          Version of Cosmin's patch that will apply to TRUNK. Compiles but thats all I've done just yet. Need to do a bit more review, make sure tests pass, then will put it up on review.hbase.org.

          Show
          stack added a comment - Version of Cosmin's patch that will apply to TRUNK. Compiles but thats all I've done just yet. Need to do a bit more review, make sure tests pass, then will put it up on review.hbase.org.
          Hide
          Cosmin Lehene added a comment -

          Ouch, it seems a lot of stuff has moved in trunk. Can't apply patch either.

          Show
          Cosmin Lehene added a comment - Ouch, it seems a lot of stuff has moved in trunk. Can't apply patch either.
          Hide
          Cosmin Lehene added a comment -

          Hey , my flight has just been canceled. I'll get back on the 19th - I'm in Chicago.
          This patch is functionally complete. It's been running on our cluster for a while, but could use some more testing. I was just hoping to get the review and make the final changes. Moving it in it's own class sounds right. I also thought about that. So I'd appreciate someone apply the patch (hope it still applies) and try it.

          Show
          Cosmin Lehene added a comment - Hey , my flight has just been canceled. I'll get back on the 19th - I'm in Chicago. This patch is functionally complete. It's been running on our cluster for a while, but could use some more testing. I was just hoping to get the review and make the final changes. Moving it in it's own class sounds right. I also thought about that. So I'd appreciate someone apply the patch (hope it still applies) and try it.
          Hide
          stack added a comment -

          If he's not back by morrow, I'll have a go at it. I've spent some time both in Cosmin's patch and in hlog as part of this forward port.

          Show
          stack added a comment - If he's not back by morrow, I'll have a go at it. I've spent some time both in Cosmin's patch and in hlog as part of this forward port.
          Hide
          Todd Lipcon added a comment -

          +1 on splitting into a new class. I'm adding some stuff into log splitting logic for HBASE-2231 as well, and it's just messy.

          Does anyone know when Cosmin's coming back? If he's not coming back soon, can we have someone finish off this patch for him this week?

          Show
          Todd Lipcon added a comment - +1 on splitting into a new class. I'm adding some stuff into log splitting logic for HBASE-2231 as well, and it's just messy. Does anyone know when Cosmin's coming back? If he's not coming back soon, can we have someone finish off this patch for him this week?
          Hide
          stack added a comment -

          Sounds good to me Clint. Maybe as separate issue? I think Cosmin is off around the wilds of the mid-west at moment on holidays. Hopefully hes not lost forever and we get him back soon.

          Show
          stack added a comment - Sounds good to me Clint. Maybe as separate issue? I think Cosmin is off around the wilds of the mid-west at moment on holidays. Hopefully hes not lost forever and we get him back soon.
          Hide
          Clint Morgan added a comment -

          I've just been looking at HLog splitting for transactional hbase. I'd like to be able to reuse the same mechanism except the HLog keys change a bit (subclass them), and the log directories are different.

          To allow these extensions, I was factoring the log splitting into its own class (HLogSplitter). And making the methods instance rather than static. This has the additional benefit of isolating this rather tricky splitting concern from HLog, which I think makes it read better IMO.

          Does this sound reasonable? Would you like to do it as part of this patch? Otherwise, I'll wait until this is applied and propose this approach in another issue.

          Show
          Clint Morgan added a comment - I've just been looking at HLog splitting for transactional hbase. I'd like to be able to reuse the same mechanism except the HLog keys change a bit (subclass them), and the log directories are different. To allow these extensions, I was factoring the log splitting into its own class (HLogSplitter). And making the methods instance rather than static. This has the additional benefit of isolating this rather tricky splitting concern from HLog, which I think makes it read better IMO. Does this sound reasonable? Would you like to do it as part of this patch? Otherwise, I'll wait until this is applied and propose this approach in another issue.
          Hide
          Cosmin Lehene added a comment -

          I'm out of office. I'll get back on the 18th.

          Show
          Cosmin Lehene added a comment - I'm out of office. I'll get back on the 18th.
          Hide
          stack added a comment -

          @Cosmin Any chance of an update. Now is the time to get this in w/ its fancy tests. Thanks.

          Show
          stack added a comment - @Cosmin Any chance of an update. Now is the time to get this in w/ its fancy tests. Thanks.
          Hide
          Alex Newman added a comment -

          I can't see to get this to apply to trunk. Is their an update, or should I do it?

          Show
          Alex Newman added a comment - I can't see to get this to apply to trunk. Is their an update, or should I do it?
          Hide
          stack added a comment -

          @Cosmin Patch is big. I need to take another look. I'd love to get this patch into 0.20.5. Its great.

          Show
          stack added a comment - @Cosmin Patch is big. I need to take another look. I'd love to get this patch into 0.20.5. Its great.
          Hide
          Cosmin Lehene added a comment -

          @Stack, thanks for the review!

          I'll incorporate the changes and answer the questions

          Regarding HBASE-2471. It doesn't check for the destination directory. I can incorporate it. The unit tests need to be changed to pre-create the region directories.

          Show
          Cosmin Lehene added a comment - @Stack, thanks for the review! I'll incorporate the changes and answer the questions Regarding HBASE-2471 . It doesn't check for the destination directory. I can incorporate it. The unit tests need to be changed to pre-create the region directories.
          Hide
          stack added a comment -

          @Cosmin You seen HBASE-2471? Related?

          Show
          stack added a comment - @Cosmin You seen HBASE-2471 ? Related?
          Hide
          stack added a comment -

          Patch looks great.

          + Please change the name of this file as part of your refactoring "oldlogfile.log"
          + "logs are left in place is something goes wrong" .. should they be moved aside?
          + HBaseTestingUtility has been splintered into smaller pieces since you made this patch so these additions of yours fit well with that general direction.
          + I love all the tests. I like name of this thread: ZombieLastLogWriterRegionServer
          + How does this test, testLogCannotBeWrittenOnceParsed, work? The ZombieLastLogWriterRegionServer can only write one more edit at most but seems like splitlogs proceeds from the first through to the last as it currently does. Why couldn't the old Zombie writer add a bunch of edits to the last file while all other files are being split?

          Show
          stack added a comment - Patch looks great. + Please change the name of this file as part of your refactoring "oldlogfile.log" + "logs are left in place is something goes wrong" .. should they be moved aside? + HBaseTestingUtility has been splintered into smaller pieces since you made this patch so these additions of yours fit well with that general direction. + I love all the tests. I like name of this thread: ZombieLastLogWriterRegionServer + How does this test, testLogCannotBeWrittenOnceParsed, work? The ZombieLastLogWriterRegionServer can only write one more edit at most but seems like splitlogs proceeds from the first through to the last as it currently does. Why couldn't the old Zombie writer add a bunch of edits to the last file while all other files are being split?
          Hide
          Cosmin Lehene added a comment -

          The patch is not final, so intended for trunk, but I'd appreciate a code review.

          some of the changes:

          • splitLog was refactored - the logic can be followed easier now
          • logs are left in place is something goes wrong
          • if split is interrupted, or crashes, the second split will start from zero (having all original log files), hence will delete any oldlogfile.log found under the regionserver if any.
          • protect from zombie HRS that writes some more to hlog after split started (using recoverLog)
          • protect from deleting a log file that was created by a zombie HRS after split has started.
          • skip.errors=true means whenever something goes wrong and might lose edits we abort leaving logs in place
          • skip.errors=false tolerate some errors: if a corrupted hlog file is encountered, read what you can and continue, then archive the corrupted log file.
          • deal with empty log files
          • etc.
          • added unit test for the above mentioned
          • unit test class has tools to generate log files, leave them open, corrupt them, etc.

          The unit (or rather integration) tests are designed for hdfs-0.21, but could be adapted with small changes.

          I initially did the refactoring trying to avoid the recoverLog method (that opens the file for append, then closes it to make sure a file is closed) because it took to long to wait for the lease. However if a regionserver that was considered dead (zombie) keeps writing to those files, the only way to work around that so we won't lose edits is to make sure it's closed (Trying to rename the file before splitting it will allow a writer thread to keep writing even after the rename for a few seconds.) I created testLogCannotBeWrittenOnceParsed for this.

          In unit tests I set the lease period for a file to 100ms in the setUp method to avoid waiting 60 seconds in the unit tests.
          getDFSCluster().getNamesystem().leaseManager.setLeasePeriod(100, 50000);

          Apparently on hdfs-0.20 getNameSystem is not available.

          Also I use hflush() in unit test to write data to a log file and then leave it open. If not flushed and left open the changes might not be seen by the reader. hflush() could be avoided if the open file scenarios could be ignored.

          Show
          Cosmin Lehene added a comment - The patch is not final, so intended for trunk, but I'd appreciate a code review. some of the changes: splitLog was refactored - the logic can be followed easier now logs are left in place is something goes wrong if split is interrupted, or crashes, the second split will start from zero (having all original log files), hence will delete any oldlogfile.log found under the regionserver if any. protect from zombie HRS that writes some more to hlog after split started (using recoverLog) protect from deleting a log file that was created by a zombie HRS after split has started. skip.errors=true means whenever something goes wrong and might lose edits we abort leaving logs in place skip.errors=false tolerate some errors: if a corrupted hlog file is encountered, read what you can and continue, then archive the corrupted log file. deal with empty log files etc. added unit test for the above mentioned unit test class has tools to generate log files, leave them open, corrupt them, etc. The unit (or rather integration) tests are designed for hdfs-0.21, but could be adapted with small changes. I initially did the refactoring trying to avoid the recoverLog method (that opens the file for append, then closes it to make sure a file is closed) because it took to long to wait for the lease. However if a regionserver that was considered dead (zombie) keeps writing to those files, the only way to work around that so we won't lose edits is to make sure it's closed (Trying to rename the file before splitting it will allow a writer thread to keep writing even after the rename for a few seconds.) I created testLogCannotBeWrittenOnceParsed for this. In unit tests I set the lease period for a file to 100ms in the setUp method to avoid waiting 60 seconds in the unit tests. getDFSCluster().getNamesystem().leaseManager.setLeasePeriod(100, 50000); Apparently on hdfs-0.20 getNameSystem is not available. Also I use hflush() in unit test to write data to a log file and then leave it open. If not flushed and left open the changes might not be seen by the reader. hflush() could be avoided if the open file scenarios could be ignored.
          Hide
          Cosmin Lehene added a comment -

          Todd, I'm sending the pach today. It still needs a little work, but the refactoring is done and could use a review, so we could coordinate.

          Show
          Cosmin Lehene added a comment - Todd, I'm sending the pach today. It still needs a little work, but the refactoring is done and could use a review, so we could coordinate.
          Hide
          Todd Lipcon added a comment -

          How's this going, Cosmin? I'm itching to fix a couple bugs in error handling during log splitting and don't want to duplicate work.

          Show
          Todd Lipcon added a comment - How's this going, Cosmin? I'm itching to fix a couple bugs in error handling during log splitting and don't want to duplicate work.
          Hide
          Cosmin Lehene added a comment -

          Andrew, I think HBASE-2337 addresses most problems already. I think we could reuse the unit tests I'm writing for 0.20 branch and fix if necessary.

          Show
          Cosmin Lehene added a comment - Andrew, I think HBASE-2337 addresses most problems already. I think we could reuse the unit tests I'm writing for 0.20 branch and fix if necessary.
          Hide
          Cosmin Lehene added a comment -

          I haven't use recoverLog because it was failing in my unit tests. So I don't check size at all and just catch EOF when opening the reader.

          Show
          Cosmin Lehene added a comment - I haven't use recoverLog because it was failing in my unit tests. So I don't check size at all and just catch EOF when opening the reader.
          Hide
          stack added a comment -

          Does fix for HBASE-2442 belong in here Cosmin?

          Show
          stack added a comment - Does fix for HBASE-2442 belong in here Cosmin?
          Hide
          Andrew Purtell added a comment -

          Cosmin, thanks for taking this on. We're aiming for 0.20.5 to be branch's answer to data durability and correctness issues. So that starts with HDFS-200 and related issues, but all issues which affect probability of data loss should be rolled in as well. What do you think about targeting your changes for that release also?

          Show
          Andrew Purtell added a comment - Cosmin, thanks for taking this on. We're aiming for 0.20.5 to be branch's answer to data durability and correctness issues. So that starts with HDFS-200 and related issues, but all issues which affect probability of data loss should be rolled in as well. What do you think about targeting your changes for that release also?
          Hide
          Cosmin Lehene added a comment -

          This issue will incorporate some of the changes in HBASE-2337

          Show
          Cosmin Lehene added a comment - This issue will incorporate some of the changes in HBASE-2337

            People

            • Assignee:
              Cosmin Lehene
              Reporter:
              Cosmin Lehene
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development