HBase
  1. HBase
  2. HBASE-5782

Edits can be appended out of seqid order since HBASE-4487

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.94.0
    • Fix Version/s: 0.94.0
    • Component/s: wal
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Create a table with 1000 splits, after the region assignemnt, kill the regionserver wich contains META table.

      Here few regions are missing after the log splitting and region assigment. HBCK report shows multiple region holes are got created.

      Same scenario was verified mulitple times in 0.92.1, no issues.

      1. hbase-5782.txt
        14 kB
        Todd Lipcon
      2. HBASE-5782.patch
        0.6 kB
        ramkrishna.s.vasudevan
      3. 5782-v3.txt
        4 kB
        Lars Hofhansl
      4. 5782-sketch.txt
        11 kB
        Todd Lipcon
      5. 5782-lars-v2.txt
        3 kB
        Lars Hofhansl
      6. 5782.unittest.txt
        1 kB
        stack
      7. 5782.unfinished-stack.txt
        6 kB
        stack
      8. 5782.txt
        3 kB
        Lars Hofhansl

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #175 (See https://builds.apache.org/job/HBase-TRUNK-security/175/)
          HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327673)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #175 (See https://builds.apache.org/job/HBase-TRUNK-security/175/ ) HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327673) Result = FAILURE larsh : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          Lars Hofhansl added a comment -

          Yes.

          Show
          Lars Hofhansl added a comment - Yes.
          Hide
          Ted Yu added a comment -

          Do we need to log another issue for trunk which would finish Todd's work ?

          Show
          Ted Yu added a comment - Do we need to log another issue for trunk which would finish Todd's work ?
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #129 (See https://builds.apache.org/job/HBase-0.94/129/)
          HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327672)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #129 (See https://builds.apache.org/job/HBase-0.94/129/ ) HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327672) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2783 (See https://builds.apache.org/job/HBase-TRUNK/2783/)
          HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327673)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2783 (See https://builds.apache.org/job/HBase-TRUNK/2783/ ) HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327673) Result = FAILURE larsh : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #15 (See https://builds.apache.org/job/HBase-0.94-security/15/)
          HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327672)

          Result = SUCCESS
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #15 (See https://builds.apache.org/job/HBase-0.94-security/15/ ) HBASE-5782 Edits can be appended out of seqid order since HBASE-4487 (Revision 1327672) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          stack added a comment -

          Sorry. Dumb. The tool calls system.exit. Let me fix in another issue.

          Show
          stack added a comment - Sorry. Dumb. The tool calls system.exit. Let me fix in another issue.
          Hide
          Lars Hofhansl added a comment -

          Now note, that TestHLog still does not run anything locally (neither in 0.94 or trunk), there is something with that specific test it seems.

          Show
          Lars Hofhansl added a comment - Now note, that TestHLog still does not run anything locally (neither in 0.94 or trunk), there is something with that specific test it seems.
          Hide
          Lars Hofhansl added a comment -

          Committed to 0.94 and 0.96 (including test)

          Show
          Lars Hofhansl added a comment - Committed to 0.94 and 0.96 (including test)
          Hide
          stack added a comment -

          You will need to pull in HLogPerformanceEvaluation. Copy it whole (don't do the hbase-5792 because it got mod'd a few times subsequent to commit). You could also just commit the unit test to trunk and not to 0.94; that should be fine long as we hold to committing patches to trunk first.

          Show
          stack added a comment - You will need to pull in HLogPerformanceEvaluation. Copy it whole (don't do the hbase-5792 because it got mod'd a few times subsequent to commit). You could also just commit the unit test to trunk and not to 0.94; that should be fine long as we hold to committing patches to trunk first.
          Hide
          Lars Hofhansl added a comment -

          For some reason I cannot run TestHLog locally. I always get:

          Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
          

          I can run all other tests that I tried (including some others marked with @LargeTests). Verified again with HLogPerformanceEvaluation manually.
          Should we put HBASE-5792 in 0.94 as well, so that we use this test?

          Show
          Lars Hofhansl added a comment - For some reason I cannot run TestHLog locally. I always get: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 I can run all other tests that I tried (including some others marked with @LargeTests). Verified again with HLogPerformanceEvaluation manually. Should we put HBASE-5792 in 0.94 as well, so that we use this test?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12523078/5782.unittest.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 6 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.wal.TestHLog

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1562//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1562//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1562//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12523078/5782.unittest.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 6 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestHLog Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1562//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1562//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1562//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12523074/5782-v3.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 6 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestSplitLogManager

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1560//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1560//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1560//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12523074/5782-v3.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 6 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1560//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1560//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1560//console This message is automatically generated.
          Hide
          stack added a comment -

          A unit test. Currently fails.

          Show
          stack added a comment - A unit test. Currently fails.
          Hide
          Lars Hofhansl added a comment - - edited

          This is what I would commit to 0.94. (plus a test that either Stack or I would write).

          Show
          Lars Hofhansl added a comment - - edited This is what I would commit to 0.94. (plus a test that either Stack or I would write).
          Hide
          Lars Hofhansl added a comment -

          I can work on a test, unless you like to Stack, or maybe you have started on it anyway.

          Show
          Lars Hofhansl added a comment - I can work on a test, unless you like to Stack, or maybe you have started on it anyway.
          Hide
          Lars Hofhansl added a comment -

          Yes, we need a test for this. What scared me most about this bug was that no test caught it, and this one of the core parts of HBase.

          This test would basically just call code from HLogPerformanceEvaluation, right?

          Show
          Lars Hofhansl added a comment - Yes, we need a test for this. What scared me most about this bug was that no test caught it, and this one of the core parts of HBase. This test would basically just call code from HLogPerformanceEvaluation, right?
          Hide
          Ted Yu added a comment -

          +1 on such a test which would prevent regression.

          Show
          Ted Yu added a comment - +1 on such a test which would prevent regression.
          Hide
          stack added a comment -

          Want me to make a test that does simple three threads with just a few edits ... say 1k... and then verifies all in order and all edits written so we notice regression?

          Show
          stack added a comment - Want me to make a test that does simple three threads with just a few edits ... say 1k... and then verifies all in order and all edits written so we notice regression?
          Hide
          ramkrishna.s.vasudevan added a comment -

          I'm +1 on Lar's patch.

          Show
          ramkrishna.s.vasudevan added a comment - I'm +1 on Lar's patch.
          Hide
          Lars Hofhansl added a comment - - edited

          I'm +1 on committing on my patch for 0.94.0. We can then either revisit for 0.94.1 or 0.96.

          Show
          Lars Hofhansl added a comment - - edited I'm +1 on committing on my patch for 0.94.0. We can then either revisit for 0.94.1 or 0.96.
          Hide
          stack added a comment -

          @Lars As to your patch being 'slower' when fewer threads, I think you can't do such a compare. W/o your patch, we are broke.

          Show
          stack added a comment - @Lars As to your patch being 'slower' when fewer threads, I think you can't do such a compare. W/o your patch, we are broke.
          Hide
          stack added a comment -

          I tried to reproduce what JD is seeing on cluster using same sized keys and values but Lars' patch completes before Todds. My test run may be too small I did thread dumps during Lars and Todd runs. Both seem to be down in sync mostly, down here 'org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.waitForAckedSeqno(DFSClient.java:3789)' otherwise hung up on sync points around wal append/sync.

          Lets go w/ the Lars patch because minimal changes. As per Todd, lets file an issue to clean up this stuff with his patch as seed. From J-D work, any grease lightening we can apply around hlog append makes for a big difference in overall write throughput.

          Show
          stack added a comment - I tried to reproduce what JD is seeing on cluster using same sized keys and values but Lars' patch completes before Todds. My test run may be too small I did thread dumps during Lars and Todd runs. Both seem to be down in sync mostly, down here 'org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.waitForAckedSeqno(DFSClient.java:3789)' otherwise hung up on sync points around wal append/sync. Lets go w/ the Lars patch because minimal changes. As per Todd, lets file an issue to clean up this stuff with his patch as seed. From J-D work, any grease lightening we can apply around hlog append makes for a big difference in overall write throughput.
          Hide
          Lars Hofhansl added a comment -

          Did some more tests with HLogPerformanceEvaluation (and my patch):

          10 threads, 100000 iterations: with patch: 42s, without patch: 41s
          5 threads, 200000 iterations: with patch: 46s, without patch: 44s
          2 threads, 200000 iterations: with patch: 46s, without patch: 44s

          So for fewer threads my patch is slightly slower.

          So... What do we do?

          Show
          Lars Hofhansl added a comment - Did some more tests with HLogPerformanceEvaluation (and my patch): 10 threads, 100000 iterations: with patch: 42s, without patch: 41s 5 threads, 200000 iterations: with patch: 46s, without patch: 44s 2 threads, 200000 iterations: with patch: 46s, without patch: 44s So for fewer threads my patch is slightly slower. So... What do we do?
          Hide
          Jean-Daniel Cryans added a comment -

          For my part I tested on a cluster with 1kb values, huge batches, and I don't see any improvement with Lars' patch. Todd's peaks 20% higher.

          Show
          Jean-Daniel Cryans added a comment - For my part I tested on a cluster with 1kb values, huge batches, and I don't see any improvement with Lars' patch. Todd's peaks 20% higher.
          Hide
          Lars Hofhansl added a comment -

          Thanks Stack!

          I'll run a test with fewer threads too, just to make sure, the fact that both of our patches are faster probably means that we did a lot of unnecessary sync'ing before...? 20% performance increase is pretty damn awesome.

          Show
          Lars Hofhansl added a comment - Thanks Stack! I'll run a test with fewer threads too, just to make sure, the fact that both of our patches are faster probably means that we did a lot of unnecessary sync'ing before...? 20% performance increase is pretty damn awesome.
          Hide
          stack added a comment -

          Ok on lars patch into 0.94.

          Show
          stack added a comment - Ok on lars patch into 0.94.
          Hide
          stack added a comment -

          I made the hlog perf tool work on hdfs and ran some basic tests. Both Todd an Lars' patches seem faster than what we have currently.

          Running w/o a fix on hdfs w/ current trunk I have to disable verify because it fails (verify happens after we print out test timings).

          $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -conf /home/stack/hadoop-conf/core-site.xml -path hdfs://sv4r11s38:7000/tmp -threads 100 -roll 10000

          12/04/17 22:58:28 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 100.630s 9937.395ops/s
          12/04/17 23:00:33 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 94.945s 10532.413ops/s

          Todd patch on hdfs:

          $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -conf /home/stack/hadoop-conf/core-site.xml -path hdfs://sv4r11s38:7000/tmp -threads 100 -roll 10000 -verify

          12/04/17 22:53:35 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 81.202s 12314.967ops/s

          Lars patch:

          12/04/17 23:07:08 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 76.800s 13020.833ops/s

          For Todd and Lars, both pass verify which checks that seqids are ordered and that we wrote as much as we think we did.

          Show
          stack added a comment - I made the hlog perf tool work on hdfs and ran some basic tests. Both Todd an Lars' patches seem faster than what we have currently. Running w/o a fix on hdfs w/ current trunk I have to disable verify because it fails (verify happens after we print out test timings). $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -conf /home/stack/hadoop-conf/core-site.xml -path hdfs://sv4r11s38:7000/tmp -threads 100 -roll 10000 12/04/17 22:58:28 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 100.630s 9937.395ops/s 12/04/17 23:00:33 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 94.945s 10532.413ops/s Todd patch on hdfs: $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -conf /home/stack/hadoop-conf/core-site.xml -path hdfs://sv4r11s38:7000/tmp -threads 100 -roll 10000 -verify 12/04/17 22:53:35 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 81.202s 12314.967ops/s Lars patch: 12/04/17 23:07:08 INFO wal.HLogPerformanceEvaluation: Summary: threads=100, iterations=10000 took 76.800s 13020.833ops/s For Todd and Lars, both pass verify which checks that seqids are ordered and that we wrote as much as we think we did.
          Hide
          Todd Lipcon added a comment -

          I think for 0.94.0 we should commit Lars's, and file another JIRA to clean up / rewrite this stuff to be less circuitous. Could definitely do with a cleanup.

          Show
          Todd Lipcon added a comment - I think for 0.94.0 we should commit Lars's, and file another JIRA to clean up / rewrite this stuff to be less circuitous. Could definitely do with a cleanup.
          Hide
          Lars Hofhansl added a comment -

          @Stack: I ran it too. It works fine. The "interesting" thing I find that it is faster with the patch! And that scares me.
          I ran with 100 threads, 10000 iterations. With the patch it took 29s, without it took 43s. This is on a 6 core machine with hyper threading.

          Now, the HLogPerformanceEvaluation does not work with -nosync and -verify, right? (presumably because no final sync is issued).

          Wouldn't mind to get some other numbers from you as well if you had some time.

          Show
          Lars Hofhansl added a comment - @Stack: I ran it too. It works fine. The "interesting" thing I find that it is faster with the patch! And that scares me. I ran with 100 threads, 10000 iterations. With the patch it took 29s, without it took 43s. This is on a 6 core machine with hyper threading. Now, the HLogPerformanceEvaluation does not work with -nosync and -verify, right? (presumably because no final sync is issued). Wouldn't mind to get some other numbers from you as well if you had some time.
          Hide
          stack added a comment -

          @Lars But it own't matter right since the map we are getting from is not under our new flush lock? I think its harmless. We will undercount whats been flushed I believe; we'll not overcount (and so possible lose data)?

          I added log rolling and tested your patch using HLogPerformanceEvaluation. It 'works' at least. If you want me to compare before and after, just say.

          Show
          stack added a comment - @Lars But it own't matter right since the map we are getting from is not under our new flush lock? I think its harmless. We will undercount whats been flushed I believe; we'll not overcount (and so possible lose data)? I added log rolling and tested your patch using HLogPerformanceEvaluation. It 'works' at least. If you want me to compare before and after, just say.
          Hide
          Lars Hofhansl added a comment -

          Ah, I see. Yes, doneUpto line should be pulled into the synchronized block.

          Show
          Lars Hofhansl added a comment - Ah, I see. Yes, doneUpto line should be pulled into the synchronized block.
          Hide
          stack added a comment -

          We won't write more into the log (once we take the pendingWrites they are gone

          Is that so? We don't get the pendingWrites until we are under the flush lock but we've taken doneUpTo before we go under the lock.

          Show
          stack added a comment - We won't write more into the log (once we take the pendingWrites they are gone Is that so? We don't get the pendingWrites until we are under the flush lock but we've taken doneUpTo before we go under the lock.
          Hide
          Lars Hofhansl added a comment -

          @Stack: the doneUpTo is fine, because it is only used to set syncedTillHere. We won't write more into the log (once we take the pendingWrites they are gone), but we might sync too much unnecessarily.
          Will do the (performance) testing now. I don't have a cluster at my disposal atm, so I'll do it with a local HDFS instance.

          If we'd rather finish Todd's for 0.94 that'd be nice.

          Show
          Lars Hofhansl added a comment - @Stack: the doneUpTo is fine, because it is only used to set syncedTillHere. We won't write more into the log (once we take the pendingWrites they are gone), but we might sync too much unnecessarily. Will do the (performance) testing now. I don't have a cluster at my disposal atm, so I'll do it with a local HDFS instance. If we'd rather finish Todd's for 0.94 that'd be nice.
          Hide
          stack added a comment -

          Can we try and make Todd's work? It does some nice cleanup.

          Show
          stack added a comment - Can we try and make Todd's work? It does some nice cleanup.
          Hide
          stack added a comment -

          Looking at Lars patch.

          On you 1. and 2. above, apparently the append is also expensive according to Dhruba. Just saying.

          Also on.... "...might lead to sync be issued multiple time when only one was necessary (it seems the same race condition existed before)."

          Yes, this we have always had.

          I'd say kill this stuff... it looks like rubbish to me:

          +      syncBatchSize.addAndGet(doneUpto - this.syncedTillHere);
          

          Its not read by anyone, looks like the math can go wonky, and when it is read, its set back to zero which is probably unexpected. Kill it I'd say.

          I think this is ok:

          +      this.syncedTillHere = Math.max(this.syncedTillHere, doneUpto);
          

          but this is racy

                 long doneUpto = this.unflushedEntries.get();
          

          It could be low in number; i.e. we could be putting into hdfs more edits than the current value of unflushedEntries if we read after an edit has been added to the queue but before the above is updated. Is that ok? Its ok if this is a little sloppy especially if it under reports?

          On tactic for 0.94, sure on doing this for 0.94 though I like Todds fix better. The verification tool will help you figure if this slows stuff much and if we are writing out of order. Let me know if you want me to run it for you. Let me add in log rolling too as per Todd suggestion.

          Show
          stack added a comment - Looking at Lars patch. On you 1. and 2. above, apparently the append is also expensive according to Dhruba. Just saying. Also on.... "...might lead to sync be issued multiple time when only one was necessary (it seems the same race condition existed before)." Yes, this we have always had. I'd say kill this stuff... it looks like rubbish to me: + syncBatchSize.addAndGet(doneUpto - this .syncedTillHere); Its not read by anyone, looks like the math can go wonky, and when it is read, its set back to zero which is probably unexpected. Kill it I'd say. I think this is ok: + this .syncedTillHere = Math .max( this .syncedTillHere, doneUpto); but this is racy long doneUpto = this .unflushedEntries.get(); It could be low in number; i.e. we could be putting into hdfs more edits than the current value of unflushedEntries if we read after an edit has been added to the queue but before the above is updated. Is that ok? Its ok if this is a little sloppy especially if it under reports? On tactic for 0.94, sure on doing this for 0.94 though I like Todds fix better. The verification tool will help you figure if this slows stuff much and if we are writing out of order. Let me know if you want me to run it for you. Let me add in log rolling too as per Todd suggestion.
          Hide
          Ted Yu added a comment -

          HLogPerformanceEvaluation verification is needed as well.

          Show
          Ted Yu added a comment - HLogPerformanceEvaluation verification is needed as well.
          Hide
          Todd Lipcon added a comment -

          Seems reasonable to me. Did yours pass tests, etc?

          Show
          Todd Lipcon added a comment - Seems reasonable to me. Did yours pass tests, etc?
          Hide
          Lars Hofhansl added a comment -

          Anybody opposed to do my patch for 0.94 and do a rewrite in trunk? Todd? Stack? Ted?

          Show
          Lars Hofhansl added a comment - Anybody opposed to do my patch for 0.94 and do a rewrite in trunk? Todd? Stack? Ted?
          Hide
          Jean-Daniel Cryans added a comment -

          The pendingWrites are appended strictly in order, so there is a very short race that might lead to sync be issued multiple time when only one was necessary (it seems the same race condition existed before).

          A short race is better than what we currently do in 0.92 (and before) where everything syncs everything.

          Show
          Jean-Daniel Cryans added a comment - The pendingWrites are appended strictly in order, so there is a very short race that might lead to sync be issued multiple time when only one was necessary (it seems the same race condition existed before). A short race is better than what we currently do in 0.92 (and before) where everything syncs everything.
          Hide
          Lars Hofhansl added a comment -

          Yep. It's crucial that syncTillHere is updated last.
          The pendingWrites are appended strictly in order, so there is a very short race that might lead to sync be issued multiple time when only one was necessary (it seems the same race condition existed before).
          So I think it is safe and low risk.

          The question now is: Do this for 0.94 and then a more elaborate rewrite in trunk, or do a more risky rewrite for 0.94?

          Show
          Lars Hofhansl added a comment - Yep. It's crucial that syncTillHere is updated last. The pendingWrites are appended strictly in order, so there is a very short race that might lead to sync be issued multiple time when only one was necessary (it seems the same race condition existed before). So I think it is safe and low risk. The question now is: Do this for 0.94 and then a more elaborate rewrite in trunk, or do a more risky rewrite for 0.94?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522975/5782-lars-v2.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1551//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1551//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1551//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522975/5782-lars-v2.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1551//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1551//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1551//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Don't think you're smoking something. It looks reasonable, though I think there are still races around the update of doneUpTo, syncedTilHere, etc. These races may be "safe" in that they only result in too-low values (and thus extra needless syncs) but I'm not 100% sure of it. Maybe we can extend the verification test case to run as a stress test, including rolls, and run it overnight or something?

          Show
          Todd Lipcon added a comment - Don't think you're smoking something. It looks reasonable, though I think there are still races around the update of doneUpTo, syncedTilHere, etc. These races may be "safe" in that they only result in too-low values (and thus extra needless syncs) but I'm not 100% sure of it. Maybe we can extend the verification test case to run as a stress test, including rolls, and run it overnight or something?
          Hide
          Lars Hofhansl added a comment -

          In the interest of keeping this simple, here's another simple patch.

          This patch assumes that:

          1. It is only the appends that need to happen in order.
          2. The sync is the time consuming operation.

          So a special lock is only held for enforce the ordering of the appends (appends are single threaded) and syncs can happen in parallel without this lock held.

          This should give us the maximum concurrency possible, while keeping the change small and palletable for 0.94.

          Please let me know whether I am smoking something.

          Show
          Lars Hofhansl added a comment - In the interest of keeping this simple, here's another simple patch. This patch assumes that: It is only the appends that need to happen in order. The sync is the time consuming operation. So a special lock is only held for enforce the ordering of the appends (appends are single threaded) and syncs can happen in parallel without this lock held. This should give us the maximum concurrency possible, while keeping the change small and palletable for 0.94. Please let me know whether I am smoking something.
          Hide
          Todd Lipcon added a comment -

          Looks like some test timeouts... will investigate tomorrow.

          Show
          Todd Lipcon added a comment - Looks like some test timeouts... will investigate tomorrow.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522913/hbase-5782.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1549//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1549//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1549//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522913/hbase-5782.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1549//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1549//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1549//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          New rev - this one should even compile!

          I tested a few of the hlog tests locally and they seem to pass. Also ran the standalone hlog tool with a bunch of threads and -verify. Throughput is identical, and errors are gone.

          I'm nervous about error handling, though... seems like if the syncer thread had an error, another thread might wait forever.

          I'd really like to rip this apart and make it all component-ized and unit-testable, but tried to keep this patch minimal.

          Show
          Todd Lipcon added a comment - New rev - this one should even compile! I tested a few of the hlog tests locally and they seem to pass. Also ran the standalone hlog tool with a bunch of threads and -verify. Throughput is identical, and errors are gone. I'm nervous about error handling, though... seems like if the syncer thread had an error, another thread might wait forever. I'd really like to rip this apart and make it all component-ized and unit-testable, but tried to keep this patch minimal.
          Hide
          stack added a comment -

          @Ram Read over the HLog comments. Its got stuff on why we want sequenceids in order and where we have dependency on their being ordered, at least they are notes on how we used to think. I was wondering too about ordering today. If we didn't have to have order, then it would make stuff like running a regionserver with N WALs a bit easier, and we don't try to guarantee sequence order when replicating. But I'm wary undoing order though without our giving the issue a bunch of thought first (Your patch above makes me nervous).

          On the patch, Todds' seems way superior to me. His is more radical, removing what seems to be a confusing sequenceid double, and its more clear whats going on.

          Oh, and thanks to you fellas for finding this one. Its a good one.

          Show
          stack added a comment - @Ram Read over the HLog comments. Its got stuff on why we want sequenceids in order and where we have dependency on their being ordered, at least they are notes on how we used to think. I was wondering too about ordering today. If we didn't have to have order, then it would make stuff like running a regionserver with N WALs a bit easier, and we don't try to guarantee sequence order when replicating. But I'm wary undoing order though without our giving the issue a bunch of thought first (Your patch above makes me nervous). On the patch, Todds' seems way superior to me. His is more radical, removing what seems to be a confusing sequenceid double, and its more clear whats going on. Oh, and thanks to you fellas for finding this one. Its a good one.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Lars
          Correct me if am wrong.
          Currently the edits are ordered atleast in the memory so that the store file that is persisted is having the proper sequence number.
          But it is when applying it to the WAL there is a change and the same needs to be taken care while replaying.
          Is there any special consideration needed that my edits have to ordered even in the WAL file? What is the design consideration behind this?
          May be am missing something here.
          @Stack
          This patch is one similar to the one i had in mind. Synchronize getPendingWrites and the 'writer.append(e);'.

          Show
          ramkrishna.s.vasudevan added a comment - @Lars Correct me if am wrong. Currently the edits are ordered atleast in the memory so that the store file that is persisted is having the proper sequence number. But it is when applying it to the WAL there is a change and the same needs to be taken care while replaying. Is there any special consideration needed that my edits have to ordered even in the WAL file? What is the design consideration behind this? May be am missing something here. @Stack This patch is one similar to the one i had in mind. Synchronize getPendingWrites and the 'writer.append(e);'.
          Hide
          Lars Hofhansl added a comment -

          @Ram: It looks like your patch should work for this scenario, but I'd be generally wary about edits not being in order in the WAL.

          Show
          Lars Hofhansl added a comment - @Ram: It looks like your patch should work for this scenario, but I'd be generally wary about edits not being in order in the WAL.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Thanks to all for working on this.
          @JD
          I agree that LogSyncer thread is not the real problem, it can happen even without that.

          My initial thought was also to introduce some locks but finally thought that the expected parallelism cannot be obtained. That is why i went with the fix while replayingtheRecoveredEdits. I will go thro the current patch also.

          Show
          ramkrishna.s.vasudevan added a comment - Thanks to all for working on this. @JD I agree that LogSyncer thread is not the real problem, it can happen even without that. My initial thought was also to introduce some locks but finally thought that the expected parallelism cannot be obtained. That is why i went with the fix while replayingtheRecoveredEdits. I will go thro the current patch also.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522890/5782.unfinished-stack.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1548//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1548//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1548//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522890/5782.unfinished-stack.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1548//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1548//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1548//console This message is automatically generated.
          Hide
          stack added a comment -

          I started in on a patch but I prefer Todd's. Mine just put together the getting of pending writes and their appending under same sync block but left sync outside of the synchronization. Unfinished.

          Show
          stack added a comment - I started in on a patch but I prefer Todd's. Mine just put together the getting of pending writes and their appending under same sync block but left sync outside of the synchronization. Unfinished.
          Hide
          stack added a comment -

          Not sure I follow but I do know this patch more ambitious than what I was at....

          + You remove the 'other' sequence numbering system, unflushedEntries? That looks good.
          + Are asserts on by default? We disabled them a while back I believe? You run w/ asserts? (Yeah, thats a good thing to test – should you use your guava test instead?)
          + Its ugly we call it hlogFlush but internal we do appends (thats not your change)
          + I agree that the reset of the the pending writes linked list needs to be done under the synchronization held by hlogFlush
          + I like how you do pushback of edits if we fail....in hlogFlush.
          + On this thing:

          +      // TODO: restore metric syncBatchSize.addAndGet(doneUpto - this.syncedTillHere);
          

          Its not used anywhere and it the math looked dodgy... then when you read it it gets set to zero so I'm not so sure it is of any use.

          Show
          stack added a comment - Not sure I follow but I do know this patch more ambitious than what I was at.... + You remove the 'other' sequence numbering system, unflushedEntries? That looks good. + Are asserts on by default? We disabled them a while back I believe? You run w/ asserts? (Yeah, thats a good thing to test – should you use your guava test instead?) + Its ugly we call it hlogFlush but internal we do appends (thats not your change) + I agree that the reset of the the pending writes linked list needs to be done under the synchronization held by hlogFlush + I like how you do pushback of edits if we fail....in hlogFlush. + On this thing: + // TODO: restore metric syncBatchSize.addAndGet(doneUpto - this .syncedTillHere); Its not used anywhere and it the math looked dodgy... then when you read it it gets set to zero so I'm not so sure it is of any use.
          Hide
          Lars Hofhansl added a comment -

          Sketched patch looks good, making sure that we do sync' a batch of write before the previous batch is done.

          I was thinking about something similar, but still allowing multiple threads to write, just that a thread cannot start writing before the previous batch is confirmed sync'ed. I guess we actually wouldn't get more parallelism out that than with your approach.

          Show
          Lars Hofhansl added a comment - Sketched patch looks good, making sure that we do sync' a batch of write before the previous batch is done. I was thinking about something similar, but still allowing multiple threads to write, just that a thread cannot start writing before the previous batch is confirmed sync'ed. I guess we actually wouldn't get more parallelism out that than with your approach.
          Hide
          Todd Lipcon added a comment -

          half-done patch of a proposed solution... got to run at the moment but will come back later with something more complete. The idea is to only have one "seqnum" in use, and make sure only a single thread pushes from the pending edits queue into a seqfile writer at a time.

          Show
          Todd Lipcon added a comment - half-done patch of a proposed solution... got to run at the moment but will come back later with something more complete. The idea is to only have one "seqnum" in use, and make sure only a single thread pushes from the pending edits queue into a seqfile writer at a time.
          Hide
          Jean-Daniel Cryans added a comment -

          It seems even without HLog.appendNoSync() is is possible that one threads flushes an entire batch of pending write before an thread that started earlier can get to it.

          We didn't have pending writes before, it was inside the sequence file writer, and we append under lock. Managing those pending writes is what's giving us trouble.

          Show
          Jean-Daniel Cryans added a comment - It seems even without HLog.appendNoSync() is is possible that one threads flushes an entire batch of pending write before an thread that started earlier can get to it. We didn't have pending writes before, it was inside the sequence file writer, and we append under lock. Managing those pending writes is what's giving us trouble.
          Hide
          Lars Hofhansl added a comment -

          Is this really only a problem because of HBASE-4528 and HBASE-4487?
          It seems even without HLog.appendNoSync() is is possible that one threads flushes an entire batch of pending write before an thread that started earlier can get to it.

          Show
          Lars Hofhansl added a comment - Is this really only a problem because of HBASE-4528 and HBASE-4487 ? It seems even without HLog.appendNoSync() is is possible that one threads flushes an entire batch of pending write before an thread that started earlier can get to it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522867/5782.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1545//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1545//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1545//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522867/5782.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1545//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1545//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1545//console This message is automatically generated.
          Hide
          Jean-Daniel Cryans added a comment -

          Changing the summary to something more relevant.

          Show
          Jean-Daniel Cryans added a comment - Changing the summary to something more relevant.
          Hide
          Jean-Daniel Cryans added a comment -

          Don't hate me, just throwing this out there.

          LOL

          My current concerns:

          • Performance, and I'm not the only one concerned.
          • OOME, what happens if you enable deferred log flush and HDFS is slowing down? That's actually an issue with HBASE-4487 since the LinkedList of pending writes is unbounded whereas before simply appending to the file would slow you down.
          • This patch will make it that threads can sync data from threads that came in later. You need to check txid <= this.syncedTillHere again once you are past the synchronized(syncLock) and return if it was taken care of while you were blocking.
          • Deadlocks, you never know when adding new locks
          Show
          Jean-Daniel Cryans added a comment - Don't hate me, just throwing this out there. LOL My current concerns: Performance, and I'm not the only one concerned. OOME, what happens if you enable deferred log flush and HDFS is slowing down? That's actually an issue with HBASE-4487 since the LinkedList of pending writes is unbounded whereas before simply appending to the file would slow you down. This patch will make it that threads can sync data from threads that came in later. You need to check txid <= this.syncedTillHere again once you are past the synchronized(syncLock) and return if it was taken care of while you were blocking. Deadlocks, you never know when adding new locks
          Hide
          Ted Yu added a comment -

          Interesting patch.
          How do we measure / compare the following combinations:
          1. HLog.appendNoSync() used with one sync thread doing flush
          2. HLog.appendNoSync() not used, multiple sync threads doing flush in parallel

          Show
          Ted Yu added a comment - Interesting patch. How do we measure / compare the following combinations: 1. HLog.appendNoSync() used with one sync thread doing flush 2. HLog.appendNoSync() not used, multiple sync threads doing flush in parallel
          Hide
          stack added a comment -

          The test passes w/ your patch in place Lars. Let me try a perf compare.

          Show
          stack added a comment - The test passes w/ your patch in place Lars. Let me try a perf compare.
          Hide
          Lars Hofhansl added a comment -

          Simple patch to ensure only one thread flushes the log.
          Don't hate me, just throwing this out there.

          Show
          Lars Hofhansl added a comment - Simple patch to ensure only one thread flushes the log. Don't hate me, just throwing this out there.
          Hide
          Lars Hofhansl added a comment -

          The short term choices we have are:

          1. revert HBASE-4528, HBASE-4487, and HBASE-5541 (are there others?)
          2. Partially revert HBASE-2467 (or devise other ways to have strictly one thread flushing an HLog).

          Maybe Todd as the author of HBASE-2467 could chime in... Todd?

          Show
          Lars Hofhansl added a comment - The short term choices we have are: revert HBASE-4528 , HBASE-4487 , and HBASE-5541 (are there others?) Partially revert HBASE-2467 (or devise other ways to have strictly one thread flushing an HLog). Maybe Todd as the author of HBASE-2467 could chime in... Todd?
          Hide
          stack added a comment -

          I just committed a tool over on HBASE-5792. It tests WALs. If you pass the -verify flag, you'll see that even w/ just three threads, sequence ids are out of order. Could be useful verifying whatever fix we have here.

          Show
          stack added a comment - I just committed a tool over on HBASE-5792 . It tests WALs. If you pass the -verify flag, you'll see that even w/ just three threads, sequence ids are out of order. Could be useful verifying whatever fix we have here.
          Hide
          Ted Yu added a comment -

          HLog.appendNoSync() is used by HRegion.

          {append|doMiniBatchPut|mutateRowsWithLocks}

          .
          Those methods would be affected when HLog.appendNoSync() is removed.

          Show
          Ted Yu added a comment - HLog.appendNoSync() is used by HRegion. {append|doMiniBatchPut|mutateRowsWithLocks} . Those methods would be affected when HLog.appendNoSync() is removed.
          Hide
          Ted Yu added a comment -

          I'd prefer solution that doesn't add a lock to patch something that's broken.

          I agree.

          I suggest the following actions:
          1. pull back HBASE-4487 in 0.94 and trunk
          2. a) spend major effort on HBASE-5699 (multiple WALs per region server)
          2. b) make HBASE-4487 semantically correct

          2.a and 2.b can proceed in parallel. But I think HBASE-5699 is the long-term solution.

          Show
          Ted Yu added a comment - I'd prefer solution that doesn't add a lock to patch something that's broken. I agree. I suggest the following actions: 1. pull back HBASE-4487 in 0.94 and trunk 2. a) spend major effort on HBASE-5699 (multiple WALs per region server) 2. b) make HBASE-4487 semantically correct 2.a and 2.b can proceed in parallel. But I think HBASE-5699 is the long-term solution.
          Hide
          Jean-Daniel Cryans added a comment -

          In order to get 0.94.0 out the door, can we pull back HBASE-4487 in 0.94 and pursue the locking approach in trunk (or separate branch) ?

          +1, we might want to review HBASE-4282 too as it seems to do something similar with the transaction ids.

          So the problem is that logSyncerThread keeps the edit in order but the syncer then applies the pending batches potentially out of order?

          It's sad that the pending edits live in LogSyncer, that thread is really just suppose to call sync... but yeah so they are added there in order but then it's a free for all in syncer. Adding a sync there could solve the issue but in the end what it does is moving the log from appending (pre HBASE-4487) to syncing plus a ton of new complexity in HLog. I'd prefer solution that doesn't add a lock to patch something that's broken.

          Show
          Jean-Daniel Cryans added a comment - In order to get 0.94.0 out the door, can we pull back HBASE-4487 in 0.94 and pursue the locking approach in trunk (or separate branch) ? +1, we might want to review HBASE-4282 too as it seems to do something similar with the transaction ids. So the problem is that logSyncerThread keeps the edit in order but the syncer then applies the pending batches potentially out of order? It's sad that the pending edits live in LogSyncer , that thread is really just suppose to call sync... but yeah so they are added there in order but then it's a free for all in syncer . Adding a sync there could solve the issue but in the end what it does is moving the log from appending (pre HBASE-4487 ) to syncing plus a ton of new complexity in HLog. I'd prefer solution that doesn't add a lock to patch something that's broken.
          Hide
          Lars Hofhansl added a comment -

          So the problem is that logSyncerThread keeps the edit in order but the syncer then applies the pending batches potentially out of order?

          We might just need a "sync lock" to prevent two threads syncing at the same time. This is different from the update lock, which also prevents writing any edits.

          Show
          Lars Hofhansl added a comment - So the problem is that logSyncerThread keeps the edit in order but the syncer then applies the pending batches potentially out of order? We might just need a "sync lock" to prevent two threads syncing at the same time. This is different from the update lock, which also prevents writing any edits.
          Hide
          Ted Yu added a comment -

          If this is right, then we should pull back HBASE-4487 or add more locks

          Adding more locks would take more time to validate / test.

          In order to get 0.94.0 out the door, can we pull back HBASE-4487 in 0.94 and pursue the locking approach in trunk (or separate branch) ?

          Show
          Ted Yu added a comment - If this is right, then we should pull back HBASE-4487 or add more locks Adding more locks would take more time to validate / test. In order to get 0.94.0 out the door, can we pull back HBASE-4487 in 0.94 and pursue the locking approach in trunk (or separate branch) ?
          Hide
          Jean-Daniel Cryans added a comment -

          Looking more into this, I think HBASE-4487 is the real issue. I think I can also prove that you can get the issue even with a disabled LogSyncer.

          t1 does appendNoSync of k1
          t1 does syncer up to getPendingWrites
          t2 does appendNoSync of k2
          t2 does syncer up to the end

          In the log you'd see k2 then k1 so what's really wrong to me is this:

          // Done in parallel for all writer threads, thanks to HDFS-895
          List<Entry> pending = logSyncerThread.getPendingWrites();
          

          Although accessing pending writes is done in sync, you can apply them in whichever way.

          Furthermore, logSyncerThread.hlogFlush can also append entries to the WAL in any order. For example, if both t1 and t2 have multiple edits they could end up intermingled in the WAL simply by doing hlogFlush at the same time.

          If LogSyncer was really an issue then HRegion.put and HRegion.delete would need to be disabled too since they don't use appendNoSync and just sync everything

          How this used to work is that threads could only append to the WAL under the updateLock and that was done at the same time as the doWrite which creates the key. The call to sync could be done by any number of threads at the same time.

          If this is right, then we should pull back HBASE-4487 or add more locks.

          We should also change this Jira's title once we get a better understanding of the problem because it's not a region assignment problem.

          Show
          Jean-Daniel Cryans added a comment - Looking more into this, I think HBASE-4487 is the real issue. I think I can also prove that you can get the issue even with a disabled LogSyncer . t1 does appendNoSync of k1 t1 does syncer up to getPendingWrites t2 does appendNoSync of k2 t2 does syncer up to the end In the log you'd see k2 then k1 so what's really wrong to me is this: // Done in parallel for all writer threads, thanks to HDFS-895 List<Entry> pending = logSyncerThread.getPendingWrites(); Although accessing pending writes is done in sync, you can apply them in whichever way. Furthermore, logSyncerThread.hlogFlush can also append entries to the WAL in any order. For example, if both t1 and t2 have multiple edits they could end up intermingled in the WAL simply by doing hlogFlush at the same time. If LogSyncer was really an issue then HRegion.put and HRegion.delete would need to be disabled too since they don't use appendNoSync and just sync everything How this used to work is that threads could only append to the WAL under the updateLock and that was done at the same time as the doWrite which creates the key. The call to sync could be done by any number of threads at the same time. If this is right, then we should pull back HBASE-4487 or add more locks. We should also change this Jira's title once we get a better understanding of the problem because it's not a region assignment problem.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522794/HBASE-5782.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1537//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1537//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1537//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522794/HBASE-5782.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1537//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1537//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1537//console This message is automatically generated.
          Hide
          Jean-Daniel Cryans added a comment -

          Based on one of our requirements i was thinking that the log syncer thread should be configurable like either to use it or not use it.

          As long as it's enabled by default I'm good with that.

          Show
          Jean-Daniel Cryans added a comment - Based on one of our requirements i was thinking that the log syncer thread should be configurable like either to use it or not use it. As long as it's enabled by default I'm good with that.
          Hide
          ramkrishna.s.vasudevan added a comment -

          This patch passes TestHLog and TestHRegion. Not able to run the testsuite as at home.
          The idea here is the edits till the seq id already persisted can be safely ignored.
          Now from this issue we are sure that the edits can appear in diff seq not necessary in the ascending order.

          So all the edits less than the seq id from the store files are skipped and all other edits are consdiered. Please provide your comments.

          Show
          ramkrishna.s.vasudevan added a comment - This patch passes TestHLog and TestHRegion. Not able to run the testsuite as at home. The idea here is the edits till the seq id already persisted can be safely ignored. Now from this issue we are sure that the edits can appear in diff seq not necessary in the ascending order. So all the edits less than the seq id from the store files are skipped and all other edits are consdiered. Please provide your comments.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Zhihong
          Yes HBASE-5699 can provide high write throughput. But even there the syncing should be done for every WAL that we will be writing there we should ensure that we do not sync everytime holding some locks.
          So ideally this problem is coming out of parallel syncs(which may happen even if we have > 1 WAL). Correct me if am wrong.

          Show
          ramkrishna.s.vasudevan added a comment - @Zhihong Yes HBASE-5699 can provide high write throughput. But even there the syncing should be done for every WAL that we will be writing there we should ensure that we do not sync everytime holding some locks. So ideally this problem is coming out of parallel syncs(which may happen even if we have > 1 WAL). Correct me if am wrong.
          Hide
          Ted Yu added a comment -

          One option is to not use appendNoSync() in doMiniBatchPut(). For high write throughput, I think HBASE-5699 should be the long-term solution.

          Show
          Ted Yu added a comment - One option is to not use appendNoSync() in doMiniBatchPut(). For high write throughput, I think HBASE-5699 should be the long-term solution.
          Hide
          ramkrishna.s.vasudevan added a comment -

          We can easily reproduce this problem. I think it applies to any edit that happens and if it is synced by the log syncer thread.
          So we must have some sync mechanism here to ensure this problem does not happen.

          Based on one of our requirements i was thinking that the log syncer thread should be configurable like either to use it or not use it.
          Please provide your suggestions.

          Show
          ramkrishna.s.vasudevan added a comment - We can easily reproduce this problem. I think it applies to any edit that happens and if it is synced by the log syncer thread. So we must have some sync mechanism here to ensure this problem does not happen. Based on one of our requirements i was thinking that the log syncer thread should be configurable like either to use it or not use it. Please provide your suggestions.
          Hide
          ramkrishna.s.vasudevan added a comment -

          The problem here is

          HLogKey logKey = makeKey(encodedRegionName, tableName, seqNum, now, clusterId);
          doWrite(info, logKey, edits, htd);
          

          We form a key here which is unique based on the seqNum. Now as we do appendNoSync() there is no sync called in append flow. But there is always the LogSyncer thread which periodically syncs.
          All the edits are added to

          synchronized void append(Entry e) throws IOException {
                pendingWrites.add(e);
              }
          

          Here the pendingWrites are obtained and then it is flushed one by one.

          pending = logSyncerThread.getPendingWrites();
                  // First flush all the pending writes to HDFS. Then 
                  // issue the sync to HDFS. If sync is successful, then update
                  // syncedTillHere to indicate that transactions till this
                  // number has been successfully synced.
                  logSyncerThread.hlogFlush(tempWriter, pending);
          

          Here there is no synchronization and hence the pendingwrites are not in sequence. Because it is not in sequence the recovered.edits file is also created with no sequence in some cases.

          if (key.getLogSeqNum() <= currentEditSeqId) {
                  	  LOG.debug("******** key is " + key.getLogSeqNum()
          						+ "current edit is " + currentEditSeqId);  
                      skippedEdits++;
                      continue;
                    }
          

          I just added the above log while replayRecoveredEdits and i can find

          ******** key is 175current edit is 176
          ******** key is 174current edit is 176
          ******** key is 186current edit is 187
          ******** key is 188current edit is 190
          ******* key is 189current edit is 190
          ******** key is 204current edit is 205
          ******** key is 206current edit is 207
          ******** key is 217current edit is 218
          ******** key is 216current edit is 218
          ******** key is 255current edit is 256
          

          This leads to dataloss. Correct me if am wrong.

          Show
          ramkrishna.s.vasudevan added a comment - The problem here is HLogKey logKey = makeKey(encodedRegionName, tableName, seqNum, now, clusterId); doWrite(info, logKey, edits, htd); We form a key here which is unique based on the seqNum. Now as we do appendNoSync() there is no sync called in append flow. But there is always the LogSyncer thread which periodically syncs. All the edits are added to synchronized void append(Entry e) throws IOException { pendingWrites.add(e); } Here the pendingWrites are obtained and then it is flushed one by one. pending = logSyncerThread.getPendingWrites(); // First flush all the pending writes to HDFS. Then // issue the sync to HDFS. If sync is successful, then update // syncedTillHere to indicate that transactions till this // number has been successfully synced. logSyncerThread.hlogFlush(tempWriter, pending); Here there is no synchronization and hence the pendingwrites are not in sequence. Because it is not in sequence the recovered.edits file is also created with no sequence in some cases. if (key.getLogSeqNum() <= currentEditSeqId) { LOG.debug( "******** key is " + key.getLogSeqNum() + "current edit is " + currentEditSeqId); skippedEdits++; continue ; } I just added the above log while replayRecoveredEdits and i can find ******** key is 175current edit is 176 ******** key is 174current edit is 176 ******** key is 186current edit is 187 ******** key is 188current edit is 190 ******* key is 189current edit is 190 ******** key is 204current edit is 205 ******** key is 206current edit is 207 ******** key is 217current edit is 218 ******** key is 216current edit is 218 ******** key is 255current edit is 256 This leads to dataloss. Correct me if am wrong.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Found the problem and working on soln. Infact this problem would have introduced data loss easily in simple RS going down scenario.

          Show
          ramkrishna.s.vasudevan added a comment - Found the problem and working on soln. Infact this problem would have introduced data loss easily in simple RS going down scenario.
          Hide
          Lars Hofhansl added a comment -

          Thanks Ram!

          Show
          Lars Hofhansl added a comment - Thanks Ram!
          Hide
          ramkrishna.s.vasudevan added a comment -

          @LArs
          Tomorrow i can try writing a test with similar scenario.

          (Although that might be difficult)

          I think it should be easier. Hopefully tomorrow can come up with some initial analysis.

          Show
          ramkrishna.s.vasudevan added a comment - @LArs Tomorrow i can try writing a test with similar scenario. (Although that might be difficult) I think it should be easier. Hopefully tomorrow can come up with some initial analysis.
          Hide
          Lars Hofhansl added a comment -

          Assigning to 0.94.0 for now

          Show
          Lars Hofhansl added a comment - Assigning to 0.94.0 for now
          Hide
          Lars Hofhansl added a comment -

          And is there anything interesting in the log files?

          Show
          Lars Hofhansl added a comment - And is there anything interesting in the log files?
          Hide
          Lars Hofhansl added a comment -

          You tried multiple times in 0.92.1, and no problem?
          Any chance to capture this in a test? (Although that might be difficult)

          Show
          Lars Hofhansl added a comment - You tried multiple times in 0.92.1, and no problem? Any chance to capture this in a test? (Although that might be difficult)
          Hide
          Gopinathan A added a comment -

          Initially I also doubted the same. I am able to reproduce same issue after disabling the wal compression.

          Show
          Gopinathan A added a comment - Initially I also doubted the same. I am able to reproduce same issue after disabling the wal compression.
          Hide
          Lars Hofhansl added a comment -

          Did you have HBASE-5778 enabled?

          Show
          Lars Hofhansl added a comment - Did you have HBASE-5778 enabled?
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Lars
          I was checking this issue. I think it has to be fixed before RC? What you say?
          Not sure of the root cause still.

          Show
          ramkrishna.s.vasudevan added a comment - @Lars I was checking this issue. I think it has to be fixed before RC? What you say? Not sure of the root cause still.

            People

            • Assignee:
              Lars Hofhansl
              Reporter:
              Gopinathan A
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development