Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.98.0, 0.96.1, 0.99.0
    • Component/s: regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Testing 0.96.1rc1.

      With one process incrementing a row in a table, we increment single col. We flush or do kills/kill-9 and data is lost. flush and kill are likely the same problem (kill would flush), kill -9 may or may not have the same root cause.

      5 nodes
      hadoop 2.1.0 (a pre cdh5b1 hdfs).
      hbase 0.96.1 rc1

      Test: 250000 increments on a single row an single col with various number of client threads (IncrementBlaster). Verify we have a count of 250000 after the run (IncrementVerifier).

      Run 1: No fault injection. 5 runs. count = 250000. on multiple runs. Correctness verified. 1638 inc/s throughput.
      Run 2: flushes table with incrementing row. count = 246875 !=250000. correctness failed. 1517 inc/s throughput.
      Run 3: kill of rs hosting incremented row. count = 243750 != 250000. Correctness failed. 1451 inc/s throughput.
      Run 4: one kill -9 of rs hosting incremented row. 246878.!= 250000. Correctness failed. 1395 inc/s (including recovery)

      1. hbase-10079-addendum.patch
        0.5 kB
        Jonathan Hsieh
      2. hbase-10079.v2.patch
        3 kB
        Jonathan Hsieh
      3. 10079.v1.patch
        0.7 kB
        Nicolas Liochon

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -

          Here's a link to the test programs I used to pull out this bug. It needs to be polished and turned in to an IT test as well as a perf test probably in a separate issue.
          https://github.com/jmhsieh/hbase/tree/increval

          Show
          Jonathan Hsieh added a comment - Here's a link to the test programs I used to pull out this bug. It needs to be polished and turned in to an IT test as well as a perf test probably in a separate issue. https://github.com/jmhsieh/hbase/tree/increval
          Hide
          Jonathan Hsieh added a comment -

          In 0.96.0:

          • flush: Not able to reproduce data loss
          • with kill: Not able to reproduce data loss. had an overcount of 1.
          • with kill -9: Not able to reproduce data loss. had an overcount of 1.

          The overcount of 1 is likely a different bug that I think that I'll let slide. Likely the client thought it failed and retried, but it actually made it to the log and increments not being idempotent.

          So the bug is somewhere between 0.96.0 and 0.96.1rc1.

          Show
          Jonathan Hsieh added a comment - In 0.96.0: flush: Not able to reproduce data loss with kill: Not able to reproduce data loss. had an overcount of 1. with kill -9: Not able to reproduce data loss. had an overcount of 1. The overcount of 1 is likely a different bug that I think that I'll let slide. Likely the client thought it failed and retried, but it actually made it to the log and increments not being idempotent. So the bug is somewhere between 0.96.0 and 0.96.1rc1.
          Hide
          Sergey Shelukhin added a comment -

          does the writer check for exceptions? can you try disabling nonces, to see if there could be collisions (although I would expect the client to receive exceptions in such cases)

          Show
          Sergey Shelukhin added a comment - does the writer check for exceptions? can you try disabling nonces, to see if there could be collisions (although I would expect the client to receive exceptions in such cases)
          Hide
          Sergey Shelukhin added a comment -

          hbase.regionserver.nonces.enabled is the server config setting. Although, during replay, the updates are never blocked if nonces collide.

          Show
          Sergey Shelukhin added a comment - hbase.regionserver.nonces.enabled is the server config setting. Although, during replay, the updates are never blocked if nonces collide.
          Hide
          Jonathan Hsieh added a comment - - edited

          Does the increval rig with the github link in the first comment check for exceptions? No, it was a quick and dirty program to duplicate a customer issue.

          I'm in the process of adding flushes to the TestAtomicOperation unit tests that Lars Hofhansl mentioned in the mailing list. I'll be able to bisect find the bug that way.

          Show
          Jonathan Hsieh added a comment - - edited Does the increval rig with the github link in the first comment check for exceptions? No, it was a quick and dirty program to duplicate a customer issue. I'm in the process of adding flushes to the TestAtomicOperation unit tests that Lars Hofhansl mentioned in the mailing list. I'll be able to bisect find the bug that way.
          Hide
          Jonathan Hsieh added a comment - - edited

          HBASE-6195 was the issue that fixed the problem in 0.94 and 0.95 branches (at the time). It added at test to TestHRegion called testParallelIncrementWithMemStoreFlush that test the situtaion.

          Show
          Jonathan Hsieh added a comment - - edited HBASE-6195 was the issue that fixed the problem in 0.94 and 0.95 branches (at the time). It added at test to TestHRegion called testParallelIncrementWithMemStoreFlush that test the situtaion.
          Hide
          Jonathan Hsieh added a comment -

          TestHRegion#testParallelIncrementWithMemStoreFlush passes on the 0.96 tip The test actually waits for all the increments to be done before flushing (instead of while other increments are happening) so my bet is that it doesn't actually test the race condition.

          Show
          Jonathan Hsieh added a comment - TestHRegion#testParallelIncrementWithMemStoreFlush passes on the 0.96 tip The test actually waits for all the increments to be done before flushing (instead of while other increments are happening) so my bet is that it doesn't actually test the race condition.
          Hide
          Jonathan Hsieh added a comment -

          I tweaked the test and wasn't able to duplicate at the unit test level. I'm looking into reverting a few patches touching memstore/flush area and testing on the cluster (HBASE-9963 and HBASE-10014 seems like candidates) to see if they caused the problem.

          Show
          Jonathan Hsieh added a comment - I tweaked the test and wasn't able to duplicate at the unit test level. I'm looking into reverting a few patches touching memstore/flush area and testing on the cluster ( HBASE-9963 and HBASE-10014 seems like candidates) to see if they caused the problem.
          Hide
          Jonathan Hsieh added a comment -

          Seems like reverting either HBASE-9963 or HBASE-10014 gets rid of the "jagged" losses due to flushes. However when testing on the tip of 0.96 with the reverts I seem to be losing some threads as the initialize becuase of some sort of race.

          I'm going to try from the exact point where 0.96.1rc1 was cut to see if it is an a happy place any will investigate the htable initialization problem afterwards.

          Show
          Jonathan Hsieh added a comment - Seems like reverting either HBASE-9963 or HBASE-10014 gets rid of the "jagged" losses due to flushes. However when testing on the tip of 0.96 with the reverts I seem to be losing some threads as the initialize becuase of some sort of race. I'm going to try from the exact point where 0.96.1rc1 was cut to see if it is an a happy place any will investigate the htable initialization problem afterwards.
          Hide
          Nicolas Liochon added a comment -

          I guess the error is in HBASE-9963. It seems there is an issue in HStore#StoreFlusherImpl#prepare: there is no lock there.

          Show
          Nicolas Liochon added a comment - I guess the error is in HBASE-9963 . It seems there is an issue in HStore#StoreFlusherImpl#prepare: there is no lock there.
          Hide
          Jonathan Hsieh added a comment -

          Actually, the current tip of 0.96 (HBASE-9485) doesn't seem to have the flush problem but does have the htable initializaiton problem.

          Show
          Jonathan Hsieh added a comment - Actually, the current tip of 0.96 ( HBASE-9485 ) doesn't seem to have the flush problem but does have the htable initializaiton problem.
          Hide
          Nicolas Liochon added a comment -

          That's strange. We should lock, and we don't do it in trunk or 0.96 head...

          Show
          Nicolas Liochon added a comment - That's strange. We should lock, and we don't do it in trunk or 0.96 head...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12617058/10079.v1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          -1 site. The patch appears to cause mvn site goal to fail.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617058/10079.v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 1 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 site . The patch appears to cause mvn site goal to fail. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//console This message is automatically generated.
          Hide
          stack added a comment -

          Patch is good. Nice work Jon. Makes sense this missing lock was exposed by hbase-9963. Pity we didn't catch it in tests previous. Any chance of a test?

          Show
          stack added a comment - Patch is good. Nice work Jon. Makes sense this missing lock was exposed by hbase-9963. Pity we didn't catch it in tests previous. Any chance of a test?
          Hide
          Sergey Shelukhin added a comment -

          +1

          Show
          Sergey Shelukhin added a comment - +1
          Hide
          Jonathan Hsieh added a comment -

          Here's the dropped "threads" stack dump – each one of these corresponds to a thread that didn't run.

          Exception in thread "Thread-58" java.lang.IllegalStateException: test was supposed to be in the cache
                  at org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337)
                  at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:385)
                  at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:165)
                  at org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:39)
                  at org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:271)
                  at org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:201)
                  at org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:180)
                  at IncrementBlaster$1.run(IncrementBlaster.java:131)
          
          Show
          Jonathan Hsieh added a comment - Here's the dropped "threads" stack dump – each one of these corresponds to a thread that didn't run. Exception in thread " Thread -58" java.lang.IllegalStateException: test was supposed to be in the cache at org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337) at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:385) at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:165) at org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:39) at org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:271) at org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:201) at org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:180) at IncrementBlaster$1.run(IncrementBlaster.java:131)
          Hide
          Jonathan Hsieh added a comment - - edited

          I'm having a hard time recreating the jagged counts. I tried reverting patches, and before and after the patch nkeywal provided. I think the flush problem was a red herring where I was biased by the customer problem I was recently working on.

          When I changed my tests to do 100000 increments the pattern I saw really jumped out. Looking at the original numbers from this morning I see the same pattern present with the 250000 increments.

          80 threads, 250000 increments == 3125 increments / thread.
          count = 246875 != 250000 (flush) // one thread failed to start.
          count = 243750 != 250000 (kill) // two threads failed to start.
          count = 246878 != 250000 (kill -9) // one thread failed to start and we had 3 threads that sent increments that succeeded and retried but didn't get an ack because of kill -9).

          The last one through me off because it wasn't regular but I think the explanation I have makes sense.

          I'm looking into seeing if my test code is bad (is there TableName documentation I ignoredthat says that the race in the stacktrace is my fault) or if we need to add some synchronization to this createTableNameIfNecessary method.

          Show
          Jonathan Hsieh added a comment - - edited I'm having a hard time recreating the jagged counts. I tried reverting patches, and before and after the patch nkeywal provided. I think the flush problem was a red herring where I was biased by the customer problem I was recently working on. When I changed my tests to do 100000 increments the pattern I saw really jumped out. Looking at the original numbers from this morning I see the same pattern present with the 250000 increments. 80 threads, 250000 increments == 3125 increments / thread. count = 246875 != 250000 (flush) // one thread failed to start. count = 243750 != 250000 (kill) // two threads failed to start. count = 246878 != 250000 (kill -9) // one thread failed to start and we had 3 threads that sent increments that succeeded and retried but didn't get an ack because of kill -9). The last one through me off because it wasn't regular but I think the explanation I have makes sense. I'm looking into seeing if my test code is bad (is there TableName documentation I ignoredthat says that the race in the stacktrace is my fault) or if we need to add some synchronization to this createTableNameIfNecessary method.
          Hide
          Jonathan Hsieh added a comment -

          Hm.. HBASE-6580 deprecates HTablePool and happened when I wasn't paying attention.

          Show
          Jonathan Hsieh added a comment - Hm.. HBASE-6580 deprecates HTablePool and happened when I wasn't paying attention.
          Hide
          Jonathan Hsieh added a comment -

          Removed HTablePool code and still got a race.

          Exception in thread "Thread-1" java.lang.IllegalStateException: test was supposed to be in the cache
                  at org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337)
                  at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:412)
                  at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:150)
                  at IncrementBlaster$1.run(IncrementBlaster.java:130)
          

          This table cache is the root cause of the race. The testing program has n threads which waits until a rendezvous point before creating independent HTable instances with the same name. It is unreasonable for separate HTable constructors that just so happen to try to open the same table to race like this. Fix should be in the TableName cache.

          Show
          Jonathan Hsieh added a comment - Removed HTablePool code and still got a race. Exception in thread " Thread -1" java.lang.IllegalStateException: test was supposed to be in the cache at org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337) at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:412) at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:150) at IncrementBlaster$1.run(IncrementBlaster.java:130) This table cache is the root cause of the race. The testing program has n threads which waits until a rendezvous point before creating independent HTable instances with the same name. It is unreasonable for separate HTable constructors that just so happen to try to open the same table to race like this. Fix should be in the TableName cache.
          Hide
          Jonathan Hsieh added a comment -

          Nicolas Liochon HBASE-9976 introduces the TableName cache which is the root cause.

          Show
          Jonathan Hsieh added a comment - Nicolas Liochon HBASE-9976 introduces the TableName cache which is the root cause.
          Hide
          Andrew Purtell added a comment -

          Doesn't this need to be in 0.98.0 also?

          Show
          Andrew Purtell added a comment - Doesn't this need to be in 0.98.0 also?
          Hide
          Lars Hofhansl added a comment - - edited

          Let's add some flushes to the Increment part of TestAtomicOperation. It should have found this issue.

          Edit: NM. Looks like this is not a flusher issue.

          Show
          Lars Hofhansl added a comment - - edited Let's add some flushes to the Increment part of TestAtomicOperation. It should have found this issue. Edit: NM. Looks like this is not a flusher issue.
          Hide
          Jonathan Hsieh added a comment - - edited

          Andrew Purtell Yes. also added 0.99

          Lars Hofhansl I'm pretty sure that the TestAtomicOperation is good and the other test metion in comments covers the case that I thought had regressed.

          This is definitely a race in TableName's caching mechanism. Our friend Todd Lipcon is fairly certain it is a ByteBuffer usage problem – gets on BB's mutate it. I'm trying to come up with a trace that shows the race (I think it has to do with Bytes.equals(byte[], BB) since that method actually mutates BB and then restores state afterwards.)

          Show
          Jonathan Hsieh added a comment - - edited Andrew Purtell Yes. also added 0.99 Lars Hofhansl I'm pretty sure that the TestAtomicOperation is good and the other test metion in comments covers the case that I thought had regressed. This is definitely a race in TableName's caching mechanism. Our friend Todd Lipcon is fairly certain it is a ByteBuffer usage problem – gets on BB's mutate it. I'm trying to come up with a trace that shows the race (I think it has to do with Bytes.equals(byte[], BB) since that method actually mutates BB and then restores state afterwards.)
          Hide
          Jonathan Hsieh added a comment - - edited

          This also jives with it not showing up on 0.96.0 and showing up on 0.96.1 – HBASE-9976 was committed recently and between the release and release candidate.

          Show
          Jonathan Hsieh added a comment - - edited This also jives with it not showing up on 0.96.0 and showing up on 0.96.1 – HBASE-9976 was committed recently and between the release and release candidate.
          Hide
          Jonathan Hsieh added a comment -
            /**
             * Check that the object does not exist already. There are two reasons for creating the objects
             * only once:
             * 1) With 100K regions, the table names take ~20MB.
             * 2) Equals becomes much faster as it's resolved with a reference and an int comparison.
             */
          01  private static TableName createTableNameIfNecessary(ByteBuffer bns, ByteBuffer qns) {
          02     for (TableName tn : tableCache) {
          03       if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(), bns)) {
          04         return tn;
          05       }
          06     }
          07 
          08     TableName newTable = new TableName(bns, qns);
          09     if (tableCache.add(newTable)) {  // Adds the specified element if it is not already present
          10      return newTable;
          11    } else {
          12      // Someone else added it. Let's find it.
          13      for (TableName tn : tableCache) {
          14        if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(), bns)) {
          15          return tn;
          16        }
          17      }
          18    }
          19
          20    throw new IllegalStateException(newTable + " was supposed to be in the cache");
          21  }
          

          Here's the race:

          We have two concurrent calls to createTableNameIfNecessary to the same namespace (which gets wrapped and becomes bns) and table qualifier (which gets wrapped and becomes qns) – ns=default and
          tn=test in my rig's case.

          Thread one executes to line 08. bns and qns are consumed by the get's in the TableName(BB,BB) constructor.
          Thread two executes to line 08. bns and qns are consumed by the get's in the TableName(BB,BB) constructor.
          Thread two returns true at line 09, and exits returns newTable at line 10.
          Thread one returns false since Thread two's TableName made it in. It jumps and continues executing at line 12
          Thread one's at line 14's first Bytes.equals methods compares the byte[] tn.getQualifier against qns (which is a consumed BB, and thus has no more data on get). This essentially always will fail.
          Thread one loops throw, falls out, and then throws the IllegalStateException.

          So anytime we get to line 14, we'll fail.

          Solution is to make sure the constructor dups bns and qns before extraction the byte[]'s. Patch coming.

          Show
          Jonathan Hsieh added a comment - /** * Check that the object does not exist already. There are two reasons for creating the objects * only once: * 1) With 100K regions, the table names take ~20MB. * 2) Equals becomes much faster as it's resolved with a reference and an int comparison. */ 01 private static TableName createTableNameIfNecessary(ByteBuffer bns, ByteBuffer qns) { 02 for (TableName tn : tableCache) { 03 if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(), bns)) { 04 return tn; 05 } 06 } 07 08 TableName newTable = new TableName(bns, qns); 09 if (tableCache.add(newTable)) { // Adds the specified element if it is not already present 10 return newTable; 11 } else { 12 // Someone else added it. Let's find it. 13 for (TableName tn : tableCache) { 14 if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(), bns)) { 15 return tn; 16 } 17 } 18 } 19 20 throw new IllegalStateException(newTable + " was supposed to be in the cache" ); 21 } Here's the race: We have two concurrent calls to createTableNameIfNecessary to the same namespace (which gets wrapped and becomes bns) and table qualifier (which gets wrapped and becomes qns) – ns=default and tn=test in my rig's case. Thread one executes to line 08. bns and qns are consumed by the get's in the TableName(BB,BB) constructor. Thread two executes to line 08. bns and qns are consumed by the get's in the TableName(BB,BB) constructor. Thread two returns true at line 09, and exits returns newTable at line 10. Thread one returns false since Thread two's TableName made it in. It jumps and continues executing at line 12 Thread one's at line 14's first Bytes.equals methods compares the byte[] tn.getQualifier against qns (which is a consumed BB, and thus has no more data on get). This essentially always will fail. Thread one loops throw, falls out, and then throws the IllegalStateException. So anytime we get to line 14, we'll fail. Solution is to make sure the constructor dups bns and qns before extraction the byte[]'s. Patch coming.
          Hide
          Jonathan Hsieh added a comment -

          v2 fixes the ByteBuffer usage bugs that cause the race.

          Show
          Jonathan Hsieh added a comment - v2 fixes the ByteBuffer usage bugs that cause the race.
          Hide
          Jonathan Hsieh added a comment - - edited

          This was useful. Specifically the "don't consume the buffer" section was applicable.
          http://worldmodscode.wordpress.com/2012/12/14/the-java-bytebuffer-a-crash-course/

          Show
          Jonathan Hsieh added a comment - - edited This was useful. Specifically the "don't consume the buffer" section was applicable. http://worldmodscode.wordpress.com/2012/12/14/the-java-bytebuffer-a-crash-course/
          Hide
          stack added a comment -

          +1

          Above comment on the race made for a good read. Thanks.

          Show
          stack added a comment - +1 Above comment on the race made for a good read. Thanks.
          Hide
          Jonathan Hsieh added a comment -

          Rig came back clean. committing to 0.96/0.98/0.99. Not relevant to 94.

          Thanks for those who took a look.

          Show
          Jonathan Hsieh added a comment - Rig came back clean. committing to 0.96/0.98/0.99. Not relevant to 94. Thanks for those who took a look.
          Hide
          Andrew Purtell added a comment -

          Belated +1, and that test sounds like a great candidate for a new integration test.

          Show
          Andrew Purtell added a comment - Belated +1, and that test sounds like a great candidate for a new integration test.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12617114/hbase-10079.v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          -1 site. The patch appears to cause mvn site goal to fail.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617114/hbase-10079.v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 2 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 site . The patch appears to cause mvn site goal to fail. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8061//console This message is automatically generated.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-TRUNK #4712 (See https://builds.apache.org/job/HBase-TRUNK/4712/)
          HBASE-10079 Race in TableName cache (jmhsieh: rev 1548020)

          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java
          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #4712 (See https://builds.apache.org/job/HBase-TRUNK/4712/ ) HBASE-10079 Race in TableName cache (jmhsieh: rev 1548020) /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #864 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/864/)
          HBASE-10079 Race in TableName cache (jmhsieh: rev 1548020)

          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java
          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #864 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/864/ ) HBASE-10079 Race in TableName cache (jmhsieh: rev 1548020) /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          Nicolas Liochon added a comment -

          Thanks a lot for fixing my code, Jon. I created HBASE-10087 for the memstore part.

          Show
          Nicolas Liochon added a comment - Thanks a lot for fixing my code, Jon. I created HBASE-10087 for the memstore part.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in hbase-0.96-hadoop2 #141 (See https://builds.apache.org/job/hbase-0.96-hadoop2/141/)
          HBASE-10079 Race in TableName cache (jmhsieh: rev 1548018)

          • /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java
          • /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - SUCCESS: Integrated in hbase-0.96-hadoop2 #141 (See https://builds.apache.org/job/hbase-0.96-hadoop2/141/ ) HBASE-10079 Race in TableName cache (jmhsieh: rev 1548018) /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in hbase-0.96 #214 (See https://builds.apache.org/job/hbase-0.96/214/)
          HBASE-10079 Race in TableName cache (jmhsieh: rev 1548018)

          • /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java
          • /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - SUCCESS: Integrated in hbase-0.96 #214 (See https://builds.apache.org/job/hbase-0.96/214/ ) HBASE-10079 Race in TableName cache (jmhsieh: rev 1548018) /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/TableName.java /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          Jonathan Hsieh added a comment -

          Addendum to fix introduced javaoc violation.

          Show
          Jonathan Hsieh added a comment - Addendum to fix introduced javaoc violation.
          Hide
          Nick Dimiduk added a comment -

          A little late to the party: nice once Jonathan Hsieh.

          Show
          Nick Dimiduk added a comment - A little late to the party: nice once Jonathan Hsieh .
          Hide
          Todd Lipcon added a comment -

          Is Bytes.equals used with ByteBuffer arguments in any hot paths? You've added a new allocation here which may be costly if so. Perhaps using the version of get() which takes an index is better in that code path?

          Show
          Todd Lipcon added a comment - Is Bytes.equals used with ByteBuffer arguments in any hot paths? You've added a new allocation here which may be costly if so. Perhaps using the version of get() which takes an index is better in that code path?
          Hide
          Nicolas Liochon added a comment -

          I had a lot of issues when I implemented this, that led me to over optimize and then to create this bug. The new version creates less objects than the 0.96.0 one at the end, so I believe it's ok for the 0.96.1. We will need more profiling sessions unfortunately. I doubt it will appear in the radar soon...

          Show
          Nicolas Liochon added a comment - I had a lot of issues when I implemented this, that led me to over optimize and then to create this bug. The new version creates less objects than the 0.96.0 one at the end, so I believe it's ok for the 0.96.1. We will need more profiling sessions unfortunately. I doubt it will appear in the radar soon...
          Hide
          Jonathan Hsieh added a comment -

          it was intro'ed with the tablename cache and only used there. it's also a
          stack only allocation and should be jit-able.


          // Jonathan Hsieh (shay)
          // Software Engineer, Cloudera
          // jon@cloudera.com

          Show
          Jonathan Hsieh added a comment - it was intro'ed with the tablename cache and only used there. it's also a stack only allocation and should be jit-able. – // Jonathan Hsieh (shay) // Software Engineer, Cloudera // jon@cloudera.com
          Hide
          Todd Lipcon added a comment -

          If only escape analysis actually worked

          Show
          Todd Lipcon added a comment - If only escape analysis actually worked
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-TRUNK #4713 (See https://builds.apache.org/job/HBase-TRUNK/4713/)
          HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548228)

          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #4713 (See https://builds.apache.org/job/HBase-TRUNK/4713/ ) HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548228) /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in hbase-0.96 #215 (See https://builds.apache.org/job/hbase-0.96/215/)
          HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548229)

          • /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - SUCCESS: Integrated in hbase-0.96 #215 (See https://builds.apache.org/job/hbase-0.96/215/ ) HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548229) /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #865 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/865/)
          HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548228)

          • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - FAILURE: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #865 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/865/ ) HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548228) /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in hbase-0.96-hadoop2 #142 (See https://builds.apache.org/job/hbase-0.96-hadoop2/142/)
          HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548229)

          • /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Show
          Hudson added a comment - SUCCESS: Integrated in hbase-0.96-hadoop2 #142 (See https://builds.apache.org/job/hbase-0.96-hadoop2/142/ ) HBASE-10079 ADDENDUM javadoc fix (jmhsieh: rev 1548229) /hbase/branches/0.96/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          Hide
          stack added a comment -

          Released in 0.96.1. Issue closed.

          Show
          stack added a comment - Released in 0.96.1. Issue closed.
          Hide
          Cosmin Lehene added a comment -

          Shouldn't the affects version be 0.96.0?

          Show
          Cosmin Lehene added a comment - Shouldn't the affects version be 0.96.0?
          Hide
          Nicolas Liochon added a comment - - edited

          It does not affect 96.0. It was introduced during the dev on 0.96.1 and was fixed in this version before beeing delivered.

          Show
          Nicolas Liochon added a comment - - edited It does not affect 96.0. It was introduced during the dev on 0.96.1 and was fixed in this version before beeing delivered.
          Hide
          Cosmin Lehene added a comment -

          Thanks [~liochon], I removed the affects version.

          Show
          Cosmin Lehene added a comment - Thanks [~liochon] , I removed the affects version.

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development