HBase
  1. HBase
  2. HBASE-2248

Provide new non-copy mechanism to assure atomic reads in get and scan

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.3
    • Fix Version/s: 0.20.4
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      This patch changes the Get code path to instead be a Scan of one row. This means than inserting cells out of timestamp order should work now (tests to verify to follow part of hbase-2294) but also that a delete at an explicit timestamp now overshadows EVEN if the effected cell is put after the delete (The old Get code path did early-out so a subsequent puts would not see the delete).
      Show
      This patch changes the Get code path to instead be a Scan of one row. This means than inserting cells out of timestamp order should work now (tests to verify to follow part of hbase-2294) but also that a delete at an explicit timestamp now overshadows EVEN if the effected cell is put after the delete (The old Get code path did early-out so a subsequent puts would not see the delete).

      Description

      HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.

      After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans. Some of our data repesent a time series. The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series. These are often scans of 0-100 rows at a time. To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete. Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.

      1. ASF.LICENSE.NOT.GRANTED--HBASE-2248-no-row-locks.txt
        4 kB
        ryan rawson
      2. ASF.LICENSE.NOT.GRANTED--HBASE-2248-rr-final1.txt
        106 kB
        ryan rawson
      3. ASF.LICENSE.NOT.GRANTED--HBASE-2248-rr-pre-durability4.txt
        126 kB
        ryan rawson
      4. ASF.LICENSE.NOT.GRANTED--put_call_graph.png
        127 kB
        Andrew Purtell
      5. ASF.LICENSE.NOT.GRANTED--profile.png
        183 kB
        Andrew Purtell
      6. HBASE-2248-GetsAsScans3.patch
        190 kB
        stack
      7. hbase-2248.txt
        5 kB
        Todd Lipcon
      8. readownwrites-lost.2.patch
        3 kB
        Todd Lipcon
      9. readownwrites-lost.patch
        3 kB
        Todd Lipcon
      10. HBASE-2248.patch
        11 kB
        stack
      11. HBASE-2248-demonstrate-previous-impl-bugs.patch
        14 kB
        stack
      12. Screen shot 2010-02-23 at 10.33.38 AM.png
        70 kB
        stack
      13. hbase-2248.gc
        67 kB
        Dave Latham
      14. threads.txt
        12 kB
        Dave Latham

        Issue Links

          Activity

          Hide
          Dave Latham added a comment -

          Here's some example threads from a dump.

          Show
          Dave Latham added a comment - Here's some example threads from a dump.
          Hide
          Dave Latham added a comment -

          After doing a flush on the table, the scans are about 100x faster.

          Show
          Dave Latham added a comment - After doing a flush on the table, the scans are about 100x faster.
          Hide
          Dan Washusen added a comment -

          I notice the performance evaluation flushes the table after each test completes, as a result none of the read tests take the memstore into account. Maybe the PerformanceEvaluation class could be changed to make the flush optional?

          Show
          Dan Washusen added a comment - I notice the performance evaluation flushes the table after each test completes, as a result none of the read tests take the memstore into account. Maybe the PerformanceEvaluation class could be changed to make the flush optional?
          Hide
          Dan Washusen added a comment -

          HBASE-2249 will allow changes to the MemStore to be performance tested.

          Show
          Dan Washusen added a comment - HBASE-2249 will allow changes to the MemStore to be performance tested.
          Hide
          Dan Washusen added a comment -

          K, with PerformanceEvaluation updates running "hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=1000 scanRange100 10" each scan takes on average 9ms to return a max of 100 rows (random data means they don't usually return 100 rows, average seemed to be around 70 rows).

          The setup for that tests is as follows;
          1 master
          4 region servers (12GB heap)

          1 million rows set up using:
          hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 1

          There were four regions all on one host. Each region had roughly 40MB in the MemStore...

          Show
          Dan Washusen added a comment - K, with PerformanceEvaluation updates running "hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=1000 scanRange100 10" each scan takes on average 9ms to return a max of 100 rows (random data means they don't usually return 100 rows, average seemed to be around 70 rows). The setup for that tests is as follows; 1 master 4 region servers (12GB heap) 1 million rows set up using: hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 1 There were four regions all on one host. Each region had roughly 40MB in the MemStore...
          Hide
          Jean-Daniel Cryans added a comment -

          1 million rows set up using:

          With randomWrite you don't write 1M rows (more like ~700,000 IIRC) so that explains why your scans aren't always of 100 rows.

          Show
          Jean-Daniel Cryans added a comment - 1 million rows set up using: With randomWrite you don't write 1M rows (more like ~700,000 IIRC) so that explains why your scans aren't always of 100 rows.
          Hide
          Dan Washusen added a comment -

          @JD: that would explain it...

          With --nomapred (10 client threads in a single VM) each scan took 120-140ms...

          Also, the randomSeekScan test each scan seems VERY slow. Each scan takes about 15 seconds...? The scanRange100 uses a startRow and stopRow to get 100 rows back (well 70 rows). The randomSeekScan using a "scan.setFilter(new WhileMatchFilter(new PageFilter(120)));". What's up with that?

          Oh, also, those tests are on the latest 0.20 branch (not on the 0.20.3 release)...

          Show
          Dan Washusen added a comment - @JD: that would explain it... With --nomapred (10 client threads in a single VM) each scan took 120-140ms... Also, the randomSeekScan test each scan seems VERY slow. Each scan takes about 15 seconds...? The scanRange100 uses a startRow and stopRow to get 100 rows back (well 70 rows). The randomSeekScan using a "scan.setFilter(new WhileMatchFilter(new PageFilter(120)));". What's up with that? Oh, also, those tests are on the latest 0.20 branch (not on the 0.20.3 release)...
          Hide
          Dan Washusen added a comment -

          Cloning the MemStore based on the scan.startRow and scan.stopRow drops the scan times from ~9ms per scan to ~3ms per scan on the above hardware...

          Show
          Dan Washusen added a comment - Cloning the MemStore based on the scan.startRow and scan.stopRow drops the scan times from ~9ms per scan to ~3ms per scan on the above hardware...
          Hide
          Todd Lipcon added a comment -

          Can anyone shed light on why HBASE-2037 introduced this clone in the first place? Seems like a totally braindead thing for performance.

          Show
          Todd Lipcon added a comment - Can anyone shed light on why HBASE-2037 introduced this clone in the first place? Seems like a totally braindead thing for performance.
          Hide
          ryan rawson added a comment -

          I had a look at the implementation of clone, and it is really not appropriate for what we are doing.

          I would like to open up discussions to revert the original patch. I would argue there has been too many lurking issues, and the additional functionality, while useful, doesnt justify crippling performance.

          Show
          ryan rawson added a comment - I had a look at the implementation of clone, and it is really not appropriate for what we are doing. I would like to open up discussions to revert the original patch. I would argue there has been too many lurking issues, and the additional functionality, while useful, doesnt justify crippling performance.
          Hide
          Dan Washusen added a comment -

          @Todd: I didn't author the change but it relates to the tests added with the change.

          @Ryan: The tests added to PE as a result of HBASE-2249 seem to indicate that even with a fully loaded MemStore it takes 9ms to complete a scan for ~100 rows with 10 concurrent client VMs hitting a single region server. That seems to contradict the 1-2 seconds seen by Dave. The thread dump does seems to indicate the clone but maybe something else is coming into play as well? Maybe the additional 4KB memory allocation is bringing GC into it?

          Show
          Dan Washusen added a comment - @Todd: I didn't author the change but it relates to the tests added with the change. @Ryan: The tests added to PE as a result of HBASE-2249 seem to indicate that even with a fully loaded MemStore it takes 9ms to complete a scan for ~100 rows with 10 concurrent client VMs hitting a single region server. That seems to contradict the 1-2 seconds seen by Dave. The thread dump does seems to indicate the clone but maybe something else is coming into play as well? Maybe the additional 4KB memory allocation is bringing GC into it?
          Hide
          ryan rawson added a comment -

          could you please tell me where your 4k of memory quote is coming from?

          the clone() is a deep/shallow clone. The KeyValues arent being cloned, but in ever other way the clone is a deep clone - it copies all the nodes! That could be literally a million nodes! The number of nodes is dependent on your data size... 64MB memstore can accomodate 1.3m values if your KeyValue size is ~ 50 bytes. Or even larger if you start kicking in the memstore multiplier during a pending snapshot, you could have 4m+ nodes in a snapshot and a oversized kvset. Clone is not really viable, it needs to be rolled back. Furthermore it doesnt provide atomic protection anyways.

          Show
          ryan rawson added a comment - could you please tell me where your 4k of memory quote is coming from? the clone() is a deep/shallow clone. The KeyValues arent being cloned, but in ever other way the clone is a deep clone - it copies all the nodes! That could be literally a million nodes! The number of nodes is dependent on your data size... 64MB memstore can accomodate 1.3m values if your KeyValue size is ~ 50 bytes. Or even larger if you start kicking in the memstore multiplier during a pending snapshot, you could have 4m+ nodes in a snapshot and a oversized kvset. Clone is not really viable, it needs to be rolled back. Furthermore it doesnt provide atomic protection anyways.
          Hide
          Dan Washusen added a comment -

          Very good point...

          Even if the clone took the scan start and stop rows into account, there is still the possibility that only one or neither of them has been provided provided...

          Show
          Dan Washusen added a comment - Very good point... Even if the clone took the scan start and stop rows into account, there is still the possibility that only one or neither of them has been provided provided...
          Hide
          Yoram Kulbak added a comment -

          Ryan:
          The 4K quote is my mistake, based on a non-typical HBASE usage (small memstore, large KVs).
          Cloning is definitely bad. It's only benefit is that it allows the scan to be isolated from on-going writes; HRegion#newScannerLock takes care of writes not coming in while the scanner is created, so 0.20.3 unlike 0.20.2 does provide protection from 'partial puts' if this was what you're implying by 'atomic protection'. There is also a test added to TestHRegion which verifies that.

          I'm not sure that rollback is a viable option:
          The 0.20.2 Memstore was using the ConcurrentSkipListMap#tailMap for every row. tailMap incurs an O(log) overhead when called on a ConcurrentSkipListMap so the total overhead of scanning the whole memstore in some cases, may be very close to the overhead of a complete sort of the KVs in memstore.
          The 0.20.2 MemStore and MemStoreScanner are also functionally incorrect since

          • The scanner may observe a 'partial put' (not atomically protected)
          • The scanner scans incorrectly when a snapshot exists

          since we observed a considerable 'single scan' performance improvement using the new MemStore implementation could the performance hit stem from increased GC overhead on multiple concurrent scans?
          Note that with 0.20.2 we observed that MemStoreScanner is running slower than StoreFileScanner..

          Is it possible to avoid both 'partial puts' and cloning by 'timestamping' memstore records? e.g. each new KV in memstore gets a 'memstore timestamp' and when a scanner is created it grabs the current timestamp so that it knows to ignore KVs which entered the store after its creation? Should probably use a counter and not currentTimeMillis to ensure a clear-cut.

          ------------
          About those ~50 byte KVs, according to my calcs:
          KeyLength: 4 bytes
          ValueLength: 4 bytes
          rowLength: 2 bytes
          FamilyLength: 1 byte
          TimeStamp: 8 bytes
          Type: 1 byte

          There are 20 bytes of overhead to start with.
          Adding an average of 10 bytes for the column and qualifier brings it to 40 bytes.
          This leaves 10 bytes (out of 50) for the row + value. Meaning 80% of the storage is overhead.
          My point is that if ~50b KVs are the common use-case some optimization needs to be made to the way things are stored.
          Perhaps you meant 50b for row+value?

          Show
          Yoram Kulbak added a comment - Ryan: The 4K quote is my mistake, based on a non-typical HBASE usage (small memstore, large KVs). Cloning is definitely bad. It's only benefit is that it allows the scan to be isolated from on-going writes; HRegion#newScannerLock takes care of writes not coming in while the scanner is created, so 0.20.3 unlike 0.20.2 does provide protection from 'partial puts' if this was what you're implying by 'atomic protection'. There is also a test added to TestHRegion which verifies that. I'm not sure that rollback is a viable option: The 0.20.2 Memstore was using the ConcurrentSkipListMap#tailMap for every row. tailMap incurs an O(log ) overhead when called on a ConcurrentSkipListMap so the total overhead of scanning the whole memstore in some cases, may be very close to the overhead of a complete sort of the KVs in memstore. The 0.20.2 MemStore and MemStoreScanner are also functionally incorrect since The scanner may observe a 'partial put' (not atomically protected) The scanner scans incorrectly when a snapshot exists since we observed a considerable 'single scan' performance improvement using the new MemStore implementation could the performance hit stem from increased GC overhead on multiple concurrent scans? Note that with 0.20.2 we observed that MemStoreScanner is running slower than StoreFileScanner.. Is it possible to avoid both 'partial puts' and cloning by 'timestamping' memstore records? e.g. each new KV in memstore gets a 'memstore timestamp' and when a scanner is created it grabs the current timestamp so that it knows to ignore KVs which entered the store after its creation? Should probably use a counter and not currentTimeMillis to ensure a clear-cut. ------------ About those ~50 byte KVs, according to my calcs: KeyLength: 4 bytes ValueLength: 4 bytes rowLength: 2 bytes FamilyLength: 1 byte TimeStamp: 8 bytes Type: 1 byte There are 20 bytes of overhead to start with. Adding an average of 10 bytes for the column and qualifier brings it to 40 bytes. This leaves 10 bytes (out of 50) for the row + value. Meaning 80% of the storage is overhead. My point is that if ~50b KVs are the common use-case some optimization needs to be made to the way things are stored. Perhaps you meant 50b for row+value?
          Hide
          Dave Latham added a comment -

          Thanks, Dan, and others for looking into this issue. The table where we were seeing these slow scans was definitely a tall, narrow table. Each row has one cell, the column family and qualifier are each one byte. The row varies, but is typically 8-20 bytes, and the value is usually 4 bytes or less. Most common is probably row - 12 bytes, col fam - 1 byte, qualifier 1 byte, value - 3 bytes, giving 17 bytes plus overhead.

          As I was trying to understand the discrepancy between the PE results you mentioned and what I've observed, I looked in to PerformanceEvaluation. It looks like the timer only starts after the scanner is constructed which means that the MemStore clone isn't being timed as part of the test, so that would probably explain why the test seems fast. Just reasoning, it seems hard to believe that ConcurrentSkipListMap.buildFromSorted could complete a million iterations that fast.

          Show
          Dave Latham added a comment - Thanks, Dan, and others for looking into this issue. The table where we were seeing these slow scans was definitely a tall, narrow table. Each row has one cell, the column family and qualifier are each one byte. The row varies, but is typically 8-20 bytes, and the value is usually 4 bytes or less. Most common is probably row - 12 bytes, col fam - 1 byte, qualifier 1 byte, value - 3 bytes, giving 17 bytes plus overhead. As I was trying to understand the discrepancy between the PE results you mentioned and what I've observed, I looked in to PerformanceEvaluation. It looks like the timer only starts after the scanner is constructed which means that the MemStore clone isn't being timed as part of the test, so that would probably explain why the test seems fast. Just reasoning, it seems hard to believe that ConcurrentSkipListMap.buildFromSorted could complete a million iterations that fast.
          Hide
          stack added a comment -

          .bq Can anyone shed light on why HBASE-2037 introduced this clone in the first place? Seems like a totally braindead thing for performance.

          Mea Culpa. I should have caught this in review, the non-scalable, expensive full-copy. Dumb.

          I also should have run PE to catch degradation in performance before release though in this case, according to Dan, as PE is now, we'd not have caught the slowed-down memstore since we flush after each PE run and since the short-scan test is new with no history (Long time ago I wrote up a how-to-release: http://wiki.apache.org/hadoop/Hbase/HowToRelease. It says PE required but I think I've not followed this receipe in a good while now).

          .bq The 0.20.2 Memstore was using the ConcurrentSkipListMap#tailMap for every row. tailMap incurs an O(log) overhead when called on a ConcurrentSkipListMap so the total overhead of scanning the whole memstore in some cases, may be very close to the overhead of a complete sort of the KVs in memstore.

          In the old implementation, we used to also make a copy of a row, everytime we called a next, to protect against the case where snapshot was removed out from under us.

          .bq The scanner scans incorrectly when a snapshot exists

          Why was this again?

          .bq ... increased GC overhead on multiple concurrent scans

          Dave, can you enable GC logging? Even if this is the case, it needs to be addressed.

          .bq Is it possible to avoid both 'partial puts' and cloning by 'timestamping' memstore records? e.g. each new KV in memstore gets a 'memstore timestamp' and when a scanner is created it grabs the current timestamp so that it knows to ignore KVs which entered the store after its creation? Should probably use a counter and not currentTimeMillis to ensure a clear-cut.

          How would we snapshot such a thing?

          We could add another ts/counter to KV. We could do an AND on the type setting a bit if extra ts is present. We then write out the KV as old style, dropping extra ts when we flush to hfile, or we just dump it all out. System would need to be able to work with old-style KVs. Comparator would be adjusted to accomodate new KV. We'd do a tailset each time we made a scanner? This would be a big change. We should probably bump rpc version and require a restart of hbase cluster on upgrade.

          Show
          stack added a comment - .bq Can anyone shed light on why HBASE-2037 introduced this clone in the first place? Seems like a totally braindead thing for performance. Mea Culpa. I should have caught this in review, the non-scalable, expensive full-copy. Dumb. I also should have run PE to catch degradation in performance before release though in this case, according to Dan, as PE is now, we'd not have caught the slowed-down memstore since we flush after each PE run and since the short-scan test is new with no history (Long time ago I wrote up a how-to-release: http://wiki.apache.org/hadoop/Hbase/HowToRelease . It says PE required but I think I've not followed this receipe in a good while now). .bq The 0.20.2 Memstore was using the ConcurrentSkipListMap#tailMap for every row. tailMap incurs an O(log) overhead when called on a ConcurrentSkipListMap so the total overhead of scanning the whole memstore in some cases, may be very close to the overhead of a complete sort of the KVs in memstore. In the old implementation, we used to also make a copy of a row, everytime we called a next, to protect against the case where snapshot was removed out from under us. .bq The scanner scans incorrectly when a snapshot exists Why was this again? .bq ... increased GC overhead on multiple concurrent scans Dave, can you enable GC logging? Even if this is the case, it needs to be addressed. .bq Is it possible to avoid both 'partial puts' and cloning by 'timestamping' memstore records? e.g. each new KV in memstore gets a 'memstore timestamp' and when a scanner is created it grabs the current timestamp so that it knows to ignore KVs which entered the store after its creation? Should probably use a counter and not currentTimeMillis to ensure a clear-cut. How would we snapshot such a thing? We could add another ts/counter to KV. We could do an AND on the type setting a bit if extra ts is present. We then write out the KV as old style, dropping extra ts when we flush to hfile, or we just dump it all out. System would need to be able to work with old-style KVs. Comparator would be adjusted to accomodate new KV. We'd do a tailset each time we made a scanner? This would be a big change. We should probably bump rpc version and require a restart of hbase cluster on upgrade.
          Hide
          Dave Latham added a comment -

          I've got gc logging enabled. Here's a snapshot of the regionserver for a few minutes during which I ran this test 5 or 6 times and generated 360 short scans. Let me know if there's any other GC info that would be useful.

          Show
          Dave Latham added a comment - I've got gc logging enabled. Here's a snapshot of the regionserver for a few minutes during which I ran this test 5 or 6 times and generated 360 short scans. Let me know if there's any other GC info that would be useful.
          Hide
          stack added a comment -

          There is a bunch of YG GC'ing going on... Might slow things some but not by much. I've attached a screen shot.

          Show
          stack added a comment - There is a bunch of YG GC'ing going on... Might slow things some but not by much. I've attached a screen shot.
          Hide
          Dan Washusen added a comment -

          @Dave: Could you have at HBASE-2249 and confirm that the call to HTable.getScanner(...) is now being timed?

          Show
          Dan Washusen added a comment - @Dave: Could you have at HBASE-2249 and confirm that the call to HTable.getScanner(...) is now being timed?
          Hide
          Dave Latham added a comment -

          @Dan: Took a read over the patch, though it seemed to be based in a different dir and didn't want to apply nicely. From what I can see the ScanTest still does getScanner in testSetup before the timer is begun. This may be fine, if the point of this test is to measure scan performance per-row and not setup/teardown time. It just explains why the ScanTest doesn't exhibit this issue. It does look like other tests, such as the RandomSeekScanTest and the new RandomScanWithRangeTest do test setup/teardown time as part of each "testRow" and should exhibit this issue, if run.

          Show
          Dave Latham added a comment - @Dan: Took a read over the patch, though it seemed to be based in a different dir and didn't want to apply nicely. From what I can see the ScanTest still does getScanner in testSetup before the timer is begun. This may be fine, if the point of this test is to measure scan performance per-row and not setup/teardown time. It just explains why the ScanTest doesn't exhibit this issue. It does look like other tests, such as the RandomSeekScanTest and the new RandomScanWithRangeTest do test setup/teardown time as part of each "testRow" and should exhibit this issue, if run.
          Hide
          ryan rawson added a comment -

          done properly, a timestamp oriented fix to version memstore should not require any RPC version bump, its all internal.

          Show
          ryan rawson added a comment - done properly, a timestamp oriented fix to version memstore should not require any RPC version bump, its all internal.
          Hide
          Dan Washusen added a comment -

          @Dave:

          Correct you are. I've added comments on HBASE-2249 as a result of your comments here...

          It's worth noting that in the case of ScanTest the cost of setting up the ResultScanner is almost non-existent compared to the cost of scanning over the majority of table. The ScanTest takes 23 seconds in total according to the log output (including opening the scanner etc).

          Dave, the numbers I posted above (9ms) were from the RandomScanWithRangeTest. As you mention, these tests include the cost of opening the scanner. I was under the impression that this was closer to your use case (e.g. specify both a scan.startRow and scan.stopRow which returns a small number of rows)...?

          Show
          Dan Washusen added a comment - @Dave: Correct you are. I've added comments on HBASE-2249 as a result of your comments here... It's worth noting that in the case of ScanTest the cost of setting up the ResultScanner is almost non-existent compared to the cost of scanning over the majority of table. The ScanTest takes 23 seconds in total according to the log output (including opening the scanner etc). Dave, the numbers I posted above (9ms) were from the RandomScanWithRangeTest. As you mention, these tests include the cost of opening the scanner. I was under the impression that this was closer to your use case (e.g. specify both a scan.startRow and scan.stopRow which returns a small number of rows)...?
          Hide
          Yoram Kulbak added a comment -

          I did the following sanity check: I rolled back memstore to just before HBASE-2037 was applied [last commit on 21 Oct 2009].
          [ To get things going I had to put back the MemStore#numKeyValues method and change the MemStore#clearSnapshot argument to SortedSet ]

          I then ran TestHRegion and two tests failed:

          • testFlushCacheWhileScanning - demonstrates the incorrect scans while a snapshot exists issue
          • testWritesWhileScanning - demonstrates 'partial puts' being visible to the scanner
            I also tried running TestMemStore but all the tests there have passed. I didn't try running the whole suite.

          It took me a while to figure out what exactly goes wrong when a snapshot exists, the short (and vague) explanation is that the scanner may return keys in a 'non ordered' manner, meaning a KV with a higher row may be returned before a KV with a lower row because the result list which aggregates results from both snapshot and kvset doesn't guarantee the KVs are added in a sorted order. I think there's a way to add a simple test to TestMemStore which will demonstrate that..

          Show
          Yoram Kulbak added a comment - I did the following sanity check: I rolled back memstore to just before HBASE-2037 was applied [last commit on 21 Oct 2009] . [ To get things going I had to put back the MemStore#numKeyValues method and change the MemStore#clearSnapshot argument to SortedSet ] I then ran TestHRegion and two tests failed: testFlushCacheWhileScanning - demonstrates the incorrect scans while a snapshot exists issue testWritesWhileScanning - demonstrates 'partial puts' being visible to the scanner I also tried running TestMemStore but all the tests there have passed. I didn't try running the whole suite. It took me a while to figure out what exactly goes wrong when a snapshot exists, the short (and vague) explanation is that the scanner may return keys in a 'non ordered' manner, meaning a KV with a higher row may be returned before a KV with a lower row because the result list which aggregates results from both snapshot and kvset doesn't guarantee the KVs are added in a sorted order. I think there's a way to add a simple test to TestMemStore which will demonstrate that..
          Hide
          stack added a comment -

          Patch that restores memstore to how it was. With this in place run memstore unit tests to see how old implementation was broke.

          Show
          stack added a comment - Patch that restores memstore to how it was. With this in place run memstore unit tests to see how old implementation was broke.
          Hide
          ryan rawson added a comment -

          i have a prototype implementation of how to fix the atomic read without using locking or copying. I'll put up a patch within a few days. It's a little subtle, but put simply it uses sequential "Timestamps" to internally version the memstore so people know to ignore half written rows.

          Show
          ryan rawson added a comment - i have a prototype implementation of how to fix the atomic read without using locking or copying. I'll put up a patch within a few days. It's a little subtle, but put simply it uses sequential "Timestamps" to internally version the memstore so people know to ignore half written rows.
          Hide
          stack added a comment -

          Here is an attempt. Tests pass. Posting for review. Need to do load tests yet.

          "- Added a (transient) int updateId to KeyValue

          • Memstore populates it on Adds and Deletes
          • When a MemstoreScanner is created it grabs the current id (actually increments it to make sure no KV has that same id) and ignores records from kvset having an id greater than the one grabbed. Snapshots are scanned in full since they're not updated during the scanner's lifetime hence there's no risk of partial updates being visible. There may be an issue with delete's becoming partly visible in this scheme, I'll check that later."
          Show
          stack added a comment - Here is an attempt. Tests pass. Posting for review. Need to do load tests yet. "- Added a (transient) int updateId to KeyValue Memstore populates it on Adds and Deletes When a MemstoreScanner is created it grabs the current id (actually increments it to make sure no KV has that same id) and ignores records from kvset having an id greater than the one grabbed. Snapshots are scanned in full since they're not updated during the scanner's lifetime hence there's no risk of partial updates being visible. There may be an issue with delete's becoming partly visible in this scheme, I'll check that later."
          Hide
          Todd Lipcon added a comment -

          There may be an issue with delete's becoming partly visible in this scheme

          I would think so - deletes in the memstore don't use tombstones, do they? Similarly for updates - if you update a row, its internal ts will update and the scanner will no longer see the old version either.

          Show
          Todd Lipcon added a comment - There may be an issue with delete's becoming partly visible in this scheme I would think so - deletes in the memstore don't use tombstones, do they? Similarly for updates - if you update a row, its internal ts will update and the scanner will no longer see the old version either.
          Hide
          ryan rawson added a comment -

          deletes use tombstones, but the current GET code might need... adjustment to make it work. I'm working on a base fix which I will post soon and I'll also check the get implementation.

          Show
          ryan rawson added a comment - deletes use tombstones, but the current GET code might need... adjustment to make it work. I'm working on a base fix which I will post soon and I'll also check the get implementation.
          Hide
          stack added a comment -

          Yeah, if client adds new edit w/ exact same ts and the comparator used by memstore does not take sequenceid into consideration, we'll have issues Todd identifies. Perhaps change the Comparator used by MemStore to consider sequenceid? Also missing from patch is enforcement of the fact that on flush, the flush file has deletes that apply to older files only – not to current flush file content.

          Show
          stack added a comment - Yeah, if client adds new edit w/ exact same ts and the comparator used by memstore does not take sequenceid into consideration, we'll have issues Todd identifies. Perhaps change the Comparator used by MemStore to consider sequenceid? Also missing from patch is enforcement of the fact that on flush, the flush file has deletes that apply to older files only – not to current flush file content.
          Hide
          ryan rawson added a comment -

          I'm working on this, there is a general approach hammered out and code to be written.

          The approach is like so:

          • on read from memstore, for each row, we grab the 'read number' and ignore any keyvalues in the structure newer (ie: > value)
          • on put to hregion/memstore, we start a write 'tx' and get a write-number, and put keyvalues with said write-number. when we are finished, that write-number is 'commited' which causes the read number to be advanced most of the time. under concurrent writes we have a little queue and slower puts may slightly hold up puts that come before it.

          this will need to be extensively tested to see how the performance profile changes. it will allow us to remove the newScannerLock.

          Show
          ryan rawson added a comment - I'm working on this, there is a general approach hammered out and code to be written. The approach is like so: on read from memstore, for each row, we grab the 'read number' and ignore any keyvalues in the structure newer (ie: > value) on put to hregion/memstore, we start a write 'tx' and get a write-number, and put keyvalues with said write-number. when we are finished, that write-number is 'commited' which causes the read number to be advanced most of the time. under concurrent writes we have a little queue and slower puts may slightly hold up puts that come before it. this will need to be extensively tested to see how the performance profile changes. it will allow us to remove the newScannerLock.
          Hide
          ryan rawson added a comment -

          Ok here is my proposal to fix this, hopefully once and for all.

          The only thing that isn't covered is deletes:

          • removing keyvalues wont ever be atomic
          • we could stop deleting key values, but the get code would have to be checked
            • the flush would also need to prune out deleted key values to keep the delete invariant of 'get' going on.
          Show
          ryan rawson added a comment - Ok here is my proposal to fix this, hopefully once and for all. The only thing that isn't covered is deletes: removing keyvalues wont ever be atomic we could stop deleting key values, but the get code would have to be checked the flush would also need to prune out deleted key values to keep the delete invariant of 'get' going on.
          Hide
          ryan rawson added a comment -

          my patch passes all the new tests added by HBASE-2037 which focus on parallelism while doing scans.

          Show
          ryan rawson added a comment - my patch passes all the new tests added by HBASE-2037 which focus on parallelism while doing scans.
          Hide
          Jonathan Gray added a comment -

          Might be time to turn gets into scans so we don't have a second read code path.

          Show
          Jonathan Gray added a comment - Might be time to turn gets into scans so we don't have a second read code path.
          Hide
          Yoram Kulbak added a comment -

          Turning gets into scans will cause some minor functional changes. See for example the differences between gets and scans exposed in TestClient#testDeletes. IMHO eliminating the functional differences between gets and scans will be a change for the better but perhaps there are existing users which rely those subtle differences.

          Show
          Yoram Kulbak added a comment - Turning gets into scans will cause some minor functional changes. See for example the differences between gets and scans exposed in TestClient#testDeletes. IMHO eliminating the functional differences between gets and scans will be a change for the better but perhaps there are existing users which rely those subtle differences.
          Hide
          Todd Lipcon added a comment -

          IMHO eliminating the functional differences between gets and scans will be a change for the better but perhaps there are existing users which rely those subtle differences

          +1 for eliminating the differences. If people are relying on broken behavior, they should fix their applications HBase is not 1.0; let's pick sanity over compatibility.

          Show
          Todd Lipcon added a comment - IMHO eliminating the functional differences between gets and scans will be a change for the better but perhaps there are existing users which rely those subtle differences +1 for eliminating the differences. If people are relying on broken behavior, they should fix their applications HBase is not 1.0; let's pick sanity over compatibility.
          Hide
          Todd Lipcon added a comment -

          Hey Ryan

          I looked over this patch a bit this afternoon. It's clever but I think it can result in loss of read-your-own-writes consistency for a single client. Consider this scenario:

          Action Read # Write # memstoreRead memstoreWrite
          Client A begins a put on row R
          1 0 1
          Client B begins a put on row S
          2 0 2
          Client B finishes a put on row S
          0 2
          Client B initiates a get on row S 0
          0 2

          So, since client A's put #1 is still ongoing on a separate row, client B is unable to read version #2 of its row.

          I think dropping consistency below read-your-own-writes is bad, even though it's rare that the above situation would occur. Under high throughput I think it's possible to occur, and it could be very very bad if people are relying on this level of consistency to implement transactions, etc.

          One possible solution is that completeMemstoreInsert can spin until memstoreRead >= e.getWriteNumber(). Given that it only has to wait for other concurrent writers to finish, a spin on memstoreRead.get() should only go a few cycles and actually be reasonably efficient.

          I'll think a bit about whether there are other possible solutions.

          Show
          Todd Lipcon added a comment - Hey Ryan I looked over this patch a bit this afternoon. It's clever but I think it can result in loss of read-your-own-writes consistency for a single client. Consider this scenario: Action Read # Write # memstoreRead memstoreWrite Client A begins a put on row R 1 0 1 Client B begins a put on row S 2 0 2 Client B finishes a put on row S 0 2 Client B initiates a get on row S 0 0 2 So, since client A's put #1 is still ongoing on a separate row, client B is unable to read version #2 of its row. I think dropping consistency below read-your-own-writes is bad, even though it's rare that the above situation would occur. Under high throughput I think it's possible to occur, and it could be very very bad if people are relying on this level of consistency to implement transactions, etc. One possible solution is that completeMemstoreInsert can spin until memstoreRead >= e.getWriteNumber(). Given that it only has to wait for other concurrent writers to finish, a spin on memstoreRead.get() should only go a few cycles and actually be reasonably efficient. I'll think a bit about whether there are other possible solutions.
          Hide
          Todd Lipcon added a comment -

          Here's a test case patch (on top of yours) which should illustrate the issue. It fails every time for me on a dual core box:

          Didnt read own writes expected:<395> but was:<394>
          junit.framework.AssertionFailedError: Didnt read own writes expected:<395> but was:<394>
          at org.apache.hadoop.hbase.regionserver.TestMemStore$ReadOwnWritesTester.internalRun(TestMemStore.java:293)
          at org.apache.hadoop.hbase.regionserver.TestMemStore$ReadOwnWritesTester.run(TestMemStore.java:268)

          Show
          Todd Lipcon added a comment - Here's a test case patch (on top of yours) which should illustrate the issue. It fails every time for me on a dual core box: Didnt read own writes expected:<395> but was:<394> junit.framework.AssertionFailedError: Didnt read own writes expected:<395> but was:<394> at org.apache.hadoop.hbase.regionserver.TestMemStore$ReadOwnWritesTester.internalRun(TestMemStore.java:293) at org.apache.hadoop.hbase.regionserver.TestMemStore$ReadOwnWritesTester.run(TestMemStore.java:268)
          Hide
          Todd Lipcon added a comment -

          Here's a slightly better test patch, much more sure to fail.

          (this test could easily be written without multiple threads, but as an illustration of the client's view of the consistency, the threads are useful)

          Show
          Todd Lipcon added a comment - Here's a slightly better test patch, much more sure to fail. (this test could easily be written without multiple threads, but as an illustration of the client's view of the consistency, the threads are useful)
          Hide
          ryan rawson added a comment -

          I think your suggestion is a good one, the race condition is really small, and holding up a client for just a few more microseconds should be reasonable. Once we restructure to not put logs appends between memstore puts, we are literally talking about the speed of adding a few dozen entries in an array. There is no data copy involved, since KeyValue was already read in during RPC time, and we are talking inserting small objects into a data structure.

          I originally thought of being speedy about returning, but read your own writes does make this be an issue. I'll add in your suggestions and put this test in as well.

          Thanks for the great test!

          Show
          ryan rawson added a comment - I think your suggestion is a good one, the race condition is really small, and holding up a client for just a few more microseconds should be reasonable. Once we restructure to not put logs appends between memstore puts, we are literally talking about the speed of adding a few dozen entries in an array. There is no data copy involved, since KeyValue was already read in during RPC time, and we are talking inserting small objects into a data structure. I originally thought of being speedy about returning, but read your own writes does make this be an issue. I'll add in your suggestions and put this test in as well. Thanks for the great test!
          Hide
          Todd Lipcon added a comment -

          Here's a patch on top of Ryan's which implements the spin-wait. The concurrency test for read-own-writes now passes.

          Show
          Todd Lipcon added a comment - Here's a patch on top of Ryan's which implements the spin-wait. The concurrency test for read-own-writes now passes.
          Hide
          stack added a comment -

          @Ryan, your next patch picks up Todds spin-wait I believe?

          Show
          stack added a comment - @Ryan, your next patch picks up Todds spin-wait I believe?
          Hide
          stack added a comment -

          Here is a different take for review and input on how to solve this issue.

          Get is now implemented using Scan. I deleted lots of get-related classes/code including the QueryMatcher. Deletes are no longer removing KV's from memstore. The change so on flush we filter deleted KVs is not done in this patch – can be done in another issue. Maybe we don't want to filter deleted KVs on flush but rather on minor compactions, for instance (The axiom that a file hold only deletes that pertain to values held in storefiles that follow may not be necessary when gets are implemented using scan?).

          Things left to do:

          • Performance test
          • More accurate heap size calculation for HRegion
          • Discuss where/when deletes should be partially applied

          Here is more detail on what this change includes:

          M src/contrib/indexed/src/java/org/apache/hadoop/hbase/regionserver/IdxRegion.java
          minor tweak due to Memstore#getScanners signature change

          M src/java/org/apache/hadoop/hbase/HConstants.java
          Appended EMPTY_KEY_VALUE_UPDATE_ID to stand for an unset update id

          M src/java/org/apache/hadoop/hbase/KeyValue.java
          Added a transient int updateId + accessors + heap size adjustment
          Added a createLastOnRow method (similar to create first on row) and
          made sure the comparator treats this case symmetrically

          M src/java/org/apache/hadoop/hbase/client/Scan.java
          Added a constructor which accepts a Get and creates a matching scan +
          a convenience method isGetScan

          M src/java/org/apache/hadoop/hbase/regionserver/ColumnTracker.java
          Modified references to QueryMatcher to refer to ScanQueryMatcher

          D src/java/org/apache/hadoop/hbase/regionserver/DeleteCompare.java
          M src/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java
          QueryMatcher -> ScanQueryMatcher

          M src/java/org/apache/hadoop/hbase/regionserver/GetClosestRowBeforeTracker.java
          QueryMatcher -> ScanQueryMatcher

          D src/java/org/apache/hadoop/hbase/regionserver/GetDeleteTracker.java
          M src/java/org/apache/hadoop/hbase/regionserver/HRegion.java
          Added a member of type RegionUpdateTracker which is intialized both
          in the constructor and every flush + heap size adjustment
          Scans are now paused while flushers prepare (e.g. snapshots are taken)
          Old gets replaced with new get implementation (which uses scans)
          #getClosestRowBefore is now using HRegion#get instead of Store#get
          #delete(Delete,Integer,boolean) no longer aquires a newScannerLock
          and also tracks update ids using the update tracker
          #delete(byte[],List,boolean) protected changed to package since it's
          used as an internal HRegion subroutine and accessed a few times by
          tests. It's also no longer aquires the update lock
          #put no longer aquires newScannerLock also modified to track update ids
          RegionScanner stop-row logic was adjusted to support get scans. Also,
          RegionUpdateTracker#UpdateIdValidator is now aquired and passed down
          to store scanners

          M src/java/org/apache/hadoop/hbase/regionserver/KeyValueSkipListSet.java
          no longer Cloneable

          M src/java/org/apache/hadoop/hbase/regionserver/MemStore.java
          deleted lots of unneeded logic, mainly around deletes (very much
          simplifed) and gets (no longer needed)

          M src/java/org/apache/hadoop/hbase/regionserver/MemStoreScanner.java
          Modified to consider UpdateIdValidator for kvset KeyValues. snapshot
          kv's are reset to undefined update if for Store#updateColumnValue to
          remain backward compatible

          D src/java/org/apache/hadoop/hbase/regionserver/QueryMatcher.java
          A src/java/org/apache/hadoop/hbase/regionserver/RegionUpdateTracker.java
          Trackes updates to HRegions. See javadoc.

          M src/java/org/apache/hadoop/hbase/regionserver/ScanDeleteTracker.java
          Fixed to throw an exception as comment suggests

          M src/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
          Merged with the deleted QueryMatcher. Added a slight variant for get
          scans to use 'lastInRows'

          M src/java/org/apache/hadoop/hbase/regionserver/ScanWildcardColumnTracker.java
          QueryMatcher -> ScanQueryMatcher

          M src/java/org/apache/hadoop/hbase/regionserver/Store.java
          #getScanner now accepts an UpdateIdValidator

          #get deleted
          #updateColumnValue modified to use scans and not memstore#getWithCode
          (which was deleted)
          D src/java/org/apache/hadoop/hbase/regionserver/StoreFileGetScan.java
          M src/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
          QueryMatcher.MatchCode -> ScanQueryMatcher.MatchCode
          passing around of the UpdateIdValidator

          D src/java/org/apache/hadoop/hbase/regionserver/WildcardColumnTracker.java
          M src/test/org/apache/hadoop/hbase/TestKeyValue.java
          M src/test/org/apache/hadoop/hbase/client/TestClient.java
          M src/test/org/apache/hadoop/hbase/io/TestHeapSize.java
          D src/test/org/apache/hadoop/hbase/regionserver/TestDeleteCompare.java
          M src/test/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java
          D src/test/org/apache/hadoop/hbase/regionserver/TestGetDeleteTracker.java
          M src/test/org/apache/hadoop/hbase/regionserver/TestHRegion.java
          M src/test/org/apache/hadoop/hbase/regionserver/TestMemStore.java
          D src/test/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java
          A src/test/org/apache/hadoop/hbase/regionserver/TestRegionUpdateTracker.java
          M src/test/org/apache/hadoop/hbase/regionserver/TestScanWildcardColumnTracker.java
          M src/test/org/apache/hadoop/hbase/regionserver/TestStore.java
          D src/test/org/apache/hadoop/hbase/regionserver/TestWildcardColumnTracker.java

          Show
          stack added a comment - Here is a different take for review and input on how to solve this issue. Get is now implemented using Scan. I deleted lots of get-related classes/code including the QueryMatcher. Deletes are no longer removing KV's from memstore. The change so on flush we filter deleted KVs is not done in this patch – can be done in another issue. Maybe we don't want to filter deleted KVs on flush but rather on minor compactions, for instance (The axiom that a file hold only deletes that pertain to values held in storefiles that follow may not be necessary when gets are implemented using scan?). Things left to do: Performance test More accurate heap size calculation for HRegion Discuss where/when deletes should be partially applied Here is more detail on what this change includes: M src/contrib/indexed/src/java/org/apache/hadoop/hbase/regionserver/IdxRegion.java minor tweak due to Memstore#getScanners signature change M src/java/org/apache/hadoop/hbase/HConstants.java Appended EMPTY_KEY_VALUE_UPDATE_ID to stand for an unset update id M src/java/org/apache/hadoop/hbase/KeyValue.java Added a transient int updateId + accessors + heap size adjustment Added a createLastOnRow method (similar to create first on row) and made sure the comparator treats this case symmetrically M src/java/org/apache/hadoop/hbase/client/Scan.java Added a constructor which accepts a Get and creates a matching scan + a convenience method isGetScan M src/java/org/apache/hadoop/hbase/regionserver/ColumnTracker.java Modified references to QueryMatcher to refer to ScanQueryMatcher D src/java/org/apache/hadoop/hbase/regionserver/DeleteCompare.java M src/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java QueryMatcher -> ScanQueryMatcher M src/java/org/apache/hadoop/hbase/regionserver/GetClosestRowBeforeTracker.java QueryMatcher -> ScanQueryMatcher D src/java/org/apache/hadoop/hbase/regionserver/GetDeleteTracker.java M src/java/org/apache/hadoop/hbase/regionserver/HRegion.java Added a member of type RegionUpdateTracker which is intialized both in the constructor and every flush + heap size adjustment Scans are now paused while flushers prepare (e.g. snapshots are taken) Old gets replaced with new get implementation (which uses scans) #getClosestRowBefore is now using HRegion#get instead of Store#get #delete(Delete,Integer,boolean) no longer aquires a newScannerLock and also tracks update ids using the update tracker #delete(byte[],List,boolean) protected changed to package since it's used as an internal HRegion subroutine and accessed a few times by tests. It's also no longer aquires the update lock #put no longer aquires newScannerLock also modified to track update ids RegionScanner stop-row logic was adjusted to support get scans. Also, RegionUpdateTracker#UpdateIdValidator is now aquired and passed down to store scanners M src/java/org/apache/hadoop/hbase/regionserver/KeyValueSkipListSet.java no longer Cloneable M src/java/org/apache/hadoop/hbase/regionserver/MemStore.java deleted lots of unneeded logic, mainly around deletes (very much simplifed) and gets (no longer needed) M src/java/org/apache/hadoop/hbase/regionserver/MemStoreScanner.java Modified to consider UpdateIdValidator for kvset KeyValues. snapshot kv's are reset to undefined update if for Store#updateColumnValue to remain backward compatible D src/java/org/apache/hadoop/hbase/regionserver/QueryMatcher.java A src/java/org/apache/hadoop/hbase/regionserver/RegionUpdateTracker.java Trackes updates to HRegions. See javadoc. M src/java/org/apache/hadoop/hbase/regionserver/ScanDeleteTracker.java Fixed to throw an exception as comment suggests M src/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java Merged with the deleted QueryMatcher. Added a slight variant for get scans to use 'lastInRows' M src/java/org/apache/hadoop/hbase/regionserver/ScanWildcardColumnTracker.java QueryMatcher -> ScanQueryMatcher M src/java/org/apache/hadoop/hbase/regionserver/Store.java #getScanner now accepts an UpdateIdValidator #get deleted #updateColumnValue modified to use scans and not memstore#getWithCode (which was deleted) D src/java/org/apache/hadoop/hbase/regionserver/StoreFileGetScan.java M src/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java QueryMatcher.MatchCode -> ScanQueryMatcher.MatchCode passing around of the UpdateIdValidator D src/java/org/apache/hadoop/hbase/regionserver/WildcardColumnTracker.java M src/test/org/apache/hadoop/hbase/TestKeyValue.java M src/test/org/apache/hadoop/hbase/client/TestClient.java M src/test/org/apache/hadoop/hbase/io/TestHeapSize.java D src/test/org/apache/hadoop/hbase/regionserver/TestDeleteCompare.java M src/test/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java D src/test/org/apache/hadoop/hbase/regionserver/TestGetDeleteTracker.java M src/test/org/apache/hadoop/hbase/regionserver/TestHRegion.java M src/test/org/apache/hadoop/hbase/regionserver/TestMemStore.java D src/test/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java A src/test/org/apache/hadoop/hbase/regionserver/TestRegionUpdateTracker.java M src/test/org/apache/hadoop/hbase/regionserver/TestScanWildcardColumnTracker.java M src/test/org/apache/hadoop/hbase/regionserver/TestStore.java D src/test/org/apache/hadoop/hbase/regionserver/TestWildcardColumnTracker.java
          Hide
          ryan rawson added a comment -

          Yes I have asked Todd and rolled up his patch. I have identified a small
          race condition in scanning today and ill fix it soon and likely post on
          Monday.

          On Mar 12, 2010 12:25 AM, "stack (JIRA)" <jira@apache.org> wrote:

          [
          https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844367#action_12844367]

          stack commented on HBASE-2248:
          ------------------------------

          @Ryan, your next patch picks up Todds spin-wait I believe?

          HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, hbase-2248.txt,
          readownwrites-lost.2.patch, readownwrites-lost.patch, Screen shot 2010-02-23
          at 10.33.38 AM.png, threads.txt
          ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot
          when starting a scan.
          scans. Some of our data repesent a time series. The data is stored in
          time series order, MR jobs often insert/update new data at the end of the
          series, and queries usually have to pick up some or all of the series.
          These are often scans of 0-100 rows at a time. To load one page, we'll
          observe about 20 such scans being triggered concurrently, and they take 2
          seconds to complete. Doing a thread dump of a region server shows many
          threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire
          map of key values to copy it.


          This message is automatically generated by JIRA.
          -
          You can reply to this email to add a comment to the issue online.

          Show
          ryan rawson added a comment - Yes I have asked Todd and rolled up his patch. I have identified a small race condition in scanning today and ill fix it soon and likely post on Monday. On Mar 12, 2010 12:25 AM, "stack (JIRA)" <jira@apache.org> wrote: [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844367#action_12844367 ] stack commented on HBASE-2248 : ------------------------------ @Ryan, your next patch picks up Todds spin-wait I believe? HBASE-2248 -ryan.patch, hbase-2248.gc, HBASE-2248 .patch, hbase-2248.txt, readownwrites-lost.2.patch, readownwrites-lost.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan. scans. Some of our data repesent a time series. The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series. These are often scans of 0-100 rows at a time. To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete. Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it. – This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
          Hide
          Todd Lipcon added a comment -

          I'm upgrading this to blocker - this patch fixes a ton of deadlock scenarios (see HBASE-2322)

          Show
          Todd Lipcon added a comment - I'm upgrading this to blocker - this patch fixes a ton of deadlock scenarios (see HBASE-2322 )
          Hide
          Todd Lipcon added a comment -

          Has this JIRA outgrown its scope? Should the reasonably small fix that Ryan and I did go in first, and then we can do the get->scan conversion separately?

          Show
          Todd Lipcon added a comment - Has this JIRA outgrown its scope? Should the reasonably small fix that Ryan and I did go in first, and then we can do the get->scan conversion separately?
          Hide
          ryan rawson added a comment -

          Here is my patch to address the issue.
          Some notes:

          • we use a single forward pushing read point to accomplish atomicity. Right now with appends and memstore puts mixed in, this causes writes to appear to be serialized. There is no sync block which prevents multiple puts from being done at the same time though. When we restructure WAL append to happen in one go, and memstore put to happen after, this issue will go away.
          • this patch does atomicity at a multi-family level. This means that if a write is going across multiple families, and we do a concurrent scan (AND if a concurrent flush also happens) we will only read the previously completely written row. No partial multi-family row reads.
          • Deletes are also atomic. By no longer removing KeyValues from memstore we make it so. Also adjusted Gets to use 1 row Scans, and put in a TODO for a bloomfilter hook.
          • During a scan, we will ride over a snapshot and see the new data. We will also reset the read point between every row, so a scan will see new values as they are inserted after it's current read point. If a row is updated after a scan has already seen it, it will of course not see that value.

          Some thanks:

          • Todd for pointing out a case where read-your-own-writes might not happen
          • Hints from the HBASE-2248-GetsAsScans patch for doing single row scans.
          Show
          ryan rawson added a comment - Here is my patch to address the issue. Some notes: we use a single forward pushing read point to accomplish atomicity. Right now with appends and memstore puts mixed in, this causes writes to appear to be serialized. There is no sync block which prevents multiple puts from being done at the same time though. When we restructure WAL append to happen in one go, and memstore put to happen after, this issue will go away. this patch does atomicity at a multi-family level. This means that if a write is going across multiple families, and we do a concurrent scan (AND if a concurrent flush also happens) we will only read the previously completely written row. No partial multi-family row reads. Deletes are also atomic. By no longer removing KeyValues from memstore we make it so. Also adjusted Gets to use 1 row Scans, and put in a TODO for a bloomfilter hook. During a scan, we will ride over a snapshot and see the new data. We will also reset the read point between every row, so a scan will see new values as they are inserted after it's current read point. If a row is updated after a scan has already seen it, it will of course not see that value. Some thanks: Todd for pointing out a case where read-your-own-writes might not happen Hints from the HBASE-2248 -GetsAsScans patch for doing single row scans.
          Hide
          Todd Lipcon added a comment -

          Hey Ryan.

          Will try to take a look at your patch early this week. Regarding the new atomicity properties, can you please comment in HBASE-2294? I want to make sure that, if we add these properties, that (a) they are really user requirements and (b) we have documented them somewhere. If (a) isn't true, we should document that, in case we find efficiency improvements we can make by dropping some of them.

          Show
          Todd Lipcon added a comment - Hey Ryan. Will try to take a look at your patch early this week. Regarding the new atomicity properties, can you please comment in HBASE-2294 ? I want to make sure that, if we add these properties, that (a) they are really user requirements and (b) we have documented them somewhere. If (a) isn't true, we should document that, in case we find efficiency improvements we can make by dropping some of them.
          Hide
          ryan rawson added a comment -

          My patch doesn't add any new properties, it just makes the existing ones efficient and removes locks. I already enumerated a number of properties we have and would like to have in HBASE-2294.

          Show
          ryan rawson added a comment - My patch doesn't add any new properties, it just makes the existing ones efficient and removes locks. I already enumerated a number of properties we have and would like to have in HBASE-2294 .
          Hide
          Andrew Purtell added a comment -

          Some feedback based on initial tests:

          • Heap sizes need to be adjusted for KeyValue (add a long), HRegion (add a reference), and MemStore (add a reference).
          • Indexed contrib is unhappy:
            • o.a.h.h.TestIdxHBaseCluster fails.
            • o.a.h.h.regionserver.TestHRegionWithIdxRegion OOMEs after 376 put iterations.

          The OOME is concerning as might be catching a memory leak introduced in the change set.

          I'm going to put it up on EC2 and throw PE at it tomorrow.

          Show
          Andrew Purtell added a comment - Some feedback based on initial tests: Heap sizes need to be adjusted for KeyValue (add a long), HRegion (add a reference), and MemStore (add a reference). Indexed contrib is unhappy: o.a.h.h.TestIdxHBaseCluster fails. o.a.h.h.regionserver.TestHRegionWithIdxRegion OOMEs after 376 put iterations. The OOME is concerning as might be catching a memory leak introduced in the change set. I'm going to put it up on EC2 and throw PE at it tomorrow.
          Hide
          Yoram Kulbak added a comment -

          .bq 'm going to put it up on EC2 and throw PE at it tomorrow.

          What are you going to compare it against?
          May I suggest comparing against both the branch without the patch and the branch with GetsAsScans3 applied?

          Show
          Yoram Kulbak added a comment - .bq 'm going to put it up on EC2 and throw PE at it tomorrow. What are you going to compare it against? May I suggest comparing against both the branch without the patch and the branch with GetsAsScans3 applied?
          Hide
          stack added a comment -

          Yeah, I just tried to run test suite and ran into at least the TestHeapSize failures.

          On a test up on cluster, something is up. Its not deadlocked but its only making slow progress. Regionservers are all waiting for something to do. Will look in morning.

          On the patch:

          + "aka DNC" ... whats DNC? (Democratic National Committee?)
          + In KV, it has "+ * @deprecated" Usually deprecated points helpfully to what should be used instead. What should folks use instead of createFirstOnRow override?
          + 1 on this comment of yours " // TODO the family and qualifier should be compared separately"
          + So, on flush of the MemStore, we don't need to clean out items that MemStore Deletes effect? We now let go of the old axiom that Deletes in storefiles only apply to storefiles that follow and not to the current storefile?
          + I love all the stuff removed.

          More review later.

          What do we see as implications of removal of the special Get-code path?

          + Is it true that now, you can do inserts where timestamps are out of order? (If no deletes?) If so, don't we need unit tests to prove this assertion?
          + What about performance? Though the new Get-Scan does storefile accesses in parallel, if > N storefiles, if looking for latest version only, we'll be slower (at least until we add BFs?).

          Show
          stack added a comment - Yeah, I just tried to run test suite and ran into at least the TestHeapSize failures. On a test up on cluster, something is up. Its not deadlocked but its only making slow progress. Regionservers are all waiting for something to do. Will look in morning. On the patch: + "aka DNC" ... whats DNC? (Democratic National Committee?) + In KV, it has "+ * @deprecated" Usually deprecated points helpfully to what should be used instead. What should folks use instead of createFirstOnRow override? + 1 on this comment of yours " // TODO the family and qualifier should be compared separately" + So, on flush of the MemStore, we don't need to clean out items that MemStore Deletes effect? We now let go of the old axiom that Deletes in storefiles only apply to storefiles that follow and not to the current storefile? + I love all the stuff removed. More review later. What do we see as implications of removal of the special Get-code path? + Is it true that now, you can do inserts where timestamps are out of order? (If no deletes?) If so, don't we need unit tests to prove this assertion? + What about performance? Though the new Get-Scan does storefile accesses in parallel, if > N storefiles, if looking for latest version only, we'll be slower (at least until we add BFs?).
          Hide
          ryan rawson added a comment -

          There is some profiling to be done to figure out what the problems might be. I think running some Unit tests under a profiler will be illuminating.

          DNC = do not care, it's a hardware engineering thing

          The problem with Get vs Scan, is the old code was not correct, so favoring a faster incorrect code is something I thought we agreed we would not do. But yes, we no longer get that assumed performance improvement.

          Show
          ryan rawson added a comment - There is some profiling to be done to figure out what the problems might be. I think running some Unit tests under a profiler will be illuminating. DNC = do not care, it's a hardware engineering thing The problem with Get vs Scan, is the old code was not correct, so favoring a faster incorrect code is something I thought we agreed we would not do. But yes, we no longer get that assumed performance improvement.
          Hide
          Todd Lipcon added a comment -

          HBASE-2265 should really help with culling access to older regions (without the more complicated bloom filter solution)

          Show
          Todd Lipcon added a comment - HBASE-2265 should really help with culling access to older regions (without the more complicated bloom filter solution)
          Hide
          Andrew Purtell added a comment -

          Commit of HBASE-2283 invalidates the patch on this issue.

          Show
          Andrew Purtell added a comment - Commit of HBASE-2283 invalidates the patch on this issue.
          Hide
          ryan rawson added a comment -

          ill update my patch to accomodate this commit.

          it doesn't "invalidate" the patch - there is still a window of opportunity to see partial row updates.

          Show
          ryan rawson added a comment - ill update my patch to accomodate this commit. it doesn't "invalidate" the patch - there is still a window of opportunity to see partial row updates.
          Hide
          Andrew Purtell added a comment -

          Yeah, ok, imprecise word choice. Thanks for rebasing.

          Show
          Andrew Purtell added a comment - Yeah, ok, imprecise word choice. Thanks for rebasing.
          Hide
          ryan rawson added a comment -

          ill try to update this today or this weekend!

          Show
          ryan rawson added a comment - ill try to update this today or this weekend!
          Hide
          stack added a comment -

          +1 on updating the patch. I just tried to do it and its a little involved so left it to the expert.

          I've been doing a rough benchmarking of 0.20.2 hbase so I can measure roughly how this patch effects coarse performance (I didn't bother measuring 0.20.3. It must be slower than 0.20.2).

          Show
          stack added a comment - +1 on updating the patch. I just tried to do it and its a little involved so left it to the expert. I've been doing a rough benchmarking of 0.20.2 hbase so I can measure roughly how this patch effects coarse performance (I didn't bother measuring 0.20.3. It must be slower than 0.20.2).
          Hide
          ryan rawson added a comment -

          Here is my update to my patch, this time I am using iterators to scan the memstore and snapshot. There are a number of fixes to all sorts of fun race conditions, etc.

          The best news: this is the fastest memstore scanner HBase has seen. It is about 15x faster than the 0.20.3 version based on the microbenchmark included in the patch. The old code takes about 400-500ms to scan 250k KeyValues in memstore, and this new patch takes 25-30ms.

          I haven't run all the tests yet, but it passes the core TestMemStore and TestHRegion which contain all the hard tests that have concurrency.

          Show
          ryan rawson added a comment - Here is my update to my patch, this time I am using iterators to scan the memstore and snapshot. There are a number of fixes to all sorts of fun race conditions, etc. The best news: this is the fastest memstore scanner HBase has seen. It is about 15x faster than the 0.20.3 version based on the microbenchmark included in the patch. The old code takes about 400-500ms to scan 250k KeyValues in memstore, and this new patch takes 25-30ms. I haven't run all the tests yet, but it passes the core TestMemStore and TestHRegion which contain all the hard tests that have concurrency.
          Hide
          ryan rawson added a comment -

          a version that compiles and passes TestHeapSize

          Show
          ryan rawson added a comment - a version that compiles and passes TestHeapSize
          Hide
          stack added a comment -

          Can we have a version for 0.20_pre_durability branch Ryan? There are a bunch of failures all in HRegion. Some I can sort of make sense of but others would take me a while to figure. You know the code so would probably take you short amount of time?

          Show
          stack added a comment - Can we have a version for 0.20_pre_durability branch Ryan? There are a bunch of failures all in HRegion. Some I can sort of make sense of but others would take me a while to figure. You know the code so would probably take you short amount of time?
          Hide
          stack added a comment -

          This is a big patch. Can we have a bit more detail than what is given above on what it does to help w/ review?

          Here's some comments so far:

          In KV, this looks like a fix to the comparator:

          +      if (rcolumnlength == 0 && rtype == Type.Minimum.getCode()) {
          +        return -1;
          +      }
          

          If we had a getScan datamember flag – true if this scan is a Get scan – we could set it if the constructor that takes a Get is invoked and avoid comparing start and end rows. If flag is not set, go ahead and do the compare.

          Want to remove this?

          +//    if (LOG.isDebugEnabled()) {
          +//      LOG.debug("compareResult=" + compareResult + " " + Bytes.toString(data, offset, length));
          +//    }
          

          This has to be public?

          +  public ReadWriteConsistencyControl getRWCC() {
          

          Is this unused?

          +  @SuppressWarnings({"UnusedDeclaration"})
             public final static byte [] REGIONINFO_FILE_BYTES =
               Bytes.toBytes(REGIONINFO_FILE);
          

          Remove it then?

          Same here:

          +  @SuppressWarnings({"UnusedDeclaration"})
             public long getRegionId() {
          

          There are a bunch of them.

          I'm about 1/5th done. So far patch looks great.

          Show
          stack added a comment - This is a big patch. Can we have a bit more detail than what is given above on what it does to help w/ review? Here's some comments so far: In KV, this looks like a fix to the comparator: + if (rcolumnlength == 0 && rtype == Type.Minimum.getCode()) { + return -1; + } If we had a getScan datamember flag – true if this scan is a Get scan – we could set it if the constructor that takes a Get is invoked and avoid comparing start and end rows. If flag is not set, go ahead and do the compare. Want to remove this? + // if (LOG.isDebugEnabled()) { + // LOG.debug( "compareResult=" + compareResult + " " + Bytes.toString(data, offset, length)); + // } This has to be public? + public ReadWriteConsistencyControl getRWCC() { Is this unused? + @SuppressWarnings({ "UnusedDeclaration" }) public final static byte [] REGIONINFO_FILE_BYTES = Bytes.toBytes(REGIONINFO_FILE); Remove it then? Same here: + @SuppressWarnings({ "UnusedDeclaration" }) public long getRegionId() { There are a bunch of them. I'm about 1/5th done. So far patch looks great.
          Hide
          Andrew Purtell added a comment -

          When I run alpha3 on top of 0.20 head I see delete test failures in :

          Testcase: testWeirdCacheBehaviour took 81.599 sec
          Testcase: testFilterAcrossMutlipleRegions took 47.8 sec
          Testcase: testSuperSimple took 18.15 sec
          Testcase: testFilters took 15.781 sec
          Testcase: testSimpleMissing took 16.448 sec
          Testcase: testSingleRowMultipleFamily took 16.686 sec
          Testcase: testNull took 21.443 sec
          Testcase: testVersions took 18.169 sec
          Testcase: testVersionLimits took 15.861 sec
          Testcase: testDeletes took 15.837 sec
          Caused an ERROR
          null
          java.lang.NullPointerException
          at org.apache.hadoop.hbase.client.TestClient.testDeletes(TestClient.java
          :1608)

          Testcase: testJIRAs took 46.161 sec

          Need more detail?

          Show
          Andrew Purtell added a comment - When I run alpha3 on top of 0.20 head I see delete test failures in : Testcase: testWeirdCacheBehaviour took 81.599 sec Testcase: testFilterAcrossMutlipleRegions took 47.8 sec Testcase: testSuperSimple took 18.15 sec Testcase: testFilters took 15.781 sec Testcase: testSimpleMissing took 16.448 sec Testcase: testSingleRowMultipleFamily took 16.686 sec Testcase: testNull took 21.443 sec Testcase: testVersions took 18.169 sec Testcase: testVersionLimits took 15.861 sec Testcase: testDeletes took 15.837 sec Caused an ERROR null java.lang.NullPointerException at org.apache.hadoop.hbase.client.TestClient.testDeletes(TestClient.java :1608) Testcase: testJIRAs took 46.161 sec Need more detail?
          Hide
          stack added a comment -

          Why do this?

          -      List<KeyValue> results = new ArrayList<KeyValue>();
          -      store.get(get, null, results);
          -      return new Result(results);
          +      get.addFamily(family);
          +      return get(get, null);
          

          Is it because of this....up in HRegion:

          +      // The reason why we set it up high is so that each HRegionScanner only
          +      // has a single read point for all its sub-StoreScanners.
          +      ReadWriteConsistencyControl.resetThreadReadPoint(rwcc);
          

          This bit of the patch looks like its breaking the accumulation of qualifiers:

          -            List<KeyValue> result = new ArrayList<KeyValue>(1);
          -            Get g = new Get(kv.getRow());
          -            g.setMaxVersions(count);
          -            NavigableSet<byte []> qualifiers =
          -              new TreeSet<byte []>(Bytes.BYTES_COMPARATOR);
          -            qualifiers.add(qual);
          -            get(store, g, qualifiers, result);
          +            Get get = new Get(kv.getRow());
          +            get.setMaxVersions(count);
          +            get.addColumn(family, qual);
          +
          +            List<KeyValue> result = get(get);
          +
          

          Or is this no longer needed because we go back into Region.get rather than do a Store.get?

          I don't like this if clause style:

          + if (w != null)
          + rwcc.completeMemstoreInsert(w);

          I'd suggest you either wrap it in params or put it all on the one line.

          For example, undo changes like this I'd say:

          -      if(lockid == null) releaseRowLock(lid);
          +      if(lockid == null)
          +        releaseRowLock(lid);
          
          

          I think this needs a comment:

          + private int isScan;

          or maybe where its assigned so its clear why it can't be a boolean though its named as though it were one... maybe change its name?

          I don't get this:

          +      // TODO call the proper GET API
                 // Get the old value:
                 Get get = new Get(row);
          

          or this:

          +    //noinspection SuspiciousMethodCalls
          

          Why this change Ryan?

          -class KeyValueSkipListSet implements NavigableSet<KeyValue>, Cloneable {
          +class KeyValueSkipListSet implements NavigableSet<KeyValue> {
          

          Its crazy how much code you've removed from MemStore around #195.

          Ok, thats enough for now

          Show
          stack added a comment - Why do this? - List<KeyValue> results = new ArrayList<KeyValue>(); - store.get(get, null , results); - return new Result(results); + get.addFamily(family); + return get(get, null ); Is it because of this....up in HRegion: + // The reason why we set it up high is so that each HRegionScanner only + // has a single read point for all its sub-StoreScanners. + ReadWriteConsistencyControl.resetThreadReadPoint(rwcc); This bit of the patch looks like its breaking the accumulation of qualifiers: - List<KeyValue> result = new ArrayList<KeyValue>(1); - Get g = new Get(kv.getRow()); - g.setMaxVersions(count); - NavigableSet< byte []> qualifiers = - new TreeSet< byte []>(Bytes.BYTES_COMPARATOR); - qualifiers.add(qual); - get(store, g, qualifiers, result); + Get get = new Get(kv.getRow()); + get.setMaxVersions(count); + get.addColumn(family, qual); + + List<KeyValue> result = get(get); + Or is this no longer needed because we go back into Region.get rather than do a Store.get? I don't like this if clause style: + if (w != null) + rwcc.completeMemstoreInsert(w); I'd suggest you either wrap it in params or put it all on the one line. For example, undo changes like this I'd say: - if (lockid == null ) releaseRowLock(lid); + if (lockid == null ) + releaseRowLock(lid); I think this needs a comment: + private int isScan; or maybe where its assigned so its clear why it can't be a boolean though its named as though it were one... maybe change its name? I don't get this: + // TODO call the proper GET API // Get the old value: Get get = new Get(row); or this: + //noinspection SuspiciousMethodCalls Why this change Ryan? -class KeyValueSkipListSet implements NavigableSet<KeyValue>, Cloneable { +class KeyValueSkipListSet implements NavigableSet<KeyValue> { Its crazy how much code you've removed from MemStore around #195. Ok, thats enough for now
          Hide
          ryan rawson added a comment -

          this patch fixes the testDelete failure that people have been seeing. it should apply on both 0.20 and 0.20_pre_durability patches

          Show
          ryan rawson added a comment - this patch fixes the testDelete failure that people have been seeing. it should apply on both 0.20 and 0.20_pre_durability patches
          Hide
          stack added a comment -

          Running unit tests, at least this one is failing for me:

          [junit] Test org.apache.hadoop.hbase.TestRegionRebalancing FAILED (timeout)

          Show
          stack added a comment - Running unit tests, at least this one is failing for me: [junit] Test org.apache.hadoop.hbase.TestRegionRebalancing FAILED (timeout)
          Hide
          stack added a comment -

          I'm trying to run on cluster but its all hanging on me. Its probably a config. messup on my part. Trying to figure it.

          Meantime, this seems to run about 3-4 times slower than release against a standalone hbase:

          $ ./bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1

          Show
          stack added a comment - I'm trying to run on cluster but its all hanging on me. Its probably a config. messup on my part. Trying to figure it. Meantime, this seems to run about 3-4 times slower than release against a standalone hbase: $ ./bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1
          Hide
          Andrew Purtell added a comment -

          0.20_pre_durability branch plus HBASE-2248-rr-pre-durability2.txt passes rebalancing for me but fails TestIdxHBaseCluster consistently.

          Testcase: testConcurrentReadWrite took 93.349 sec
          	FAILED
          nextCount=0, count=2, finalCount=2000
          junit.framework.AssertionFailedError: nextCount=0, count=2, finalCount=2000
          	at org.apache.hadoop.hbase.TestIdxHBaseCluster.testConcurrentReadWrite(TestIdxHBaseCluster.java:123)
          
          Testcase: testHBaseCluster took 41.074 sec
          

          Before the patch the indexed contrib tests pass on 0.20_pre_durability.

          Show
          Andrew Purtell added a comment - 0.20_pre_durability branch plus HBASE-2248 -rr-pre-durability2.txt passes rebalancing for me but fails TestIdxHBaseCluster consistently. Testcase: testConcurrentReadWrite took 93.349 sec FAILED nextCount=0, count=2, finalCount=2000 junit.framework.AssertionFailedError: nextCount=0, count=2, finalCount=2000 at org.apache.hadoop.hbase.TestIdxHBaseCluster.testConcurrentReadWrite(TestIdxHBaseCluster.java:123) Testcase: testHBaseCluster took 41.074 sec Before the patch the indexed contrib tests pass on 0.20_pre_durability.
          Hide
          stack added a comment -

          In the above I'm trying to test branch of branch. All core tests but above noted rebalancing passed.

          Show
          stack added a comment - In the above I'm trying to test branch of branch. All core tests but above noted rebalancing passed.
          Hide
          ryan rawson added a comment -

          unfortunately the way the branch and the branch-of-branch does things have diverged a lot. specifically the locations of the update lock and the flush request. I rearranged things a bunch and have this new patch. This might help with the PE slowness and other things as well.

          Show
          ryan rawson added a comment - unfortunately the way the branch and the branch-of-branch does things have diverged a lot. specifically the locations of the update lock and the flush request. I rearranged things a bunch and have this new patch. This might help with the PE slowness and other things as well.
          Hide
          Andrew Purtell added a comment -

          On head of 0.20_pre_durability and patch rr-pre_durability3.txt I see this:

          Testcase: testWritesWhileScanning took 0.155 sec
                    FAILED
          i=36 expected<1000> but was: <0>
          junit.framework.AssertionFailedError: i=36 expected:<1000> but was: <0>
                    at org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileScanning(TestHRegion.java:2014)
          
          Show
          Andrew Purtell added a comment - On head of 0.20_pre_durability and patch rr-pre_durability3.txt I see this: Testcase: testWritesWhileScanning took 0.155 sec FAILED i=36 expected<1000> but was: <0> junit.framework.AssertionFailedError: i=36 expected:<1000> but was: <0> at org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileScanning(TestHRegion.java:2014)
          Hide
          stack added a comment -

          I'm about 50% through HBASE-2248-rr-alpha3.txt:

          ReadWriteConsistencyControl is missing a license (I like thename of this class and its nice and clean looking).

          This could be final:

          + private long writeNumber;

          This class doesn't have to be public:

          + public static class WriteEntry {

          Can you explain the below better please?

          -    // The Get returns the latest value but then does not return the
          -    // oldest, which was never deleted, ts[1]. 
          -    
          +
          +    // It used to be due to the internal implementation of Get, that
          +    // the Get() call would return ts[4] UNLIKE the Scan below. With
          +    // the switch to using Scan for Get this is no longer the case.
               get = new Get(ROW);
               get.addFamily(FAMILIES[0]);
               get.setMaxVersions(Integer.MAX_VALUE);
               result = ht.get(get);
               assertNResult(result, ROW, FAMILIES[0], QUALIFIER, 
          -        new long [] {ts[2], ts[3], ts[4]},
          -        new byte[][] {VALUES[2], VALUES[3], VALUES[4]},
          +        new long [] {ts[1], ts[2], ts[3]},
          +        new byte[][] {VALUES[1], VALUES[2], VALUES[3]},
                   0, 2);
          

          Reading it, it would seem that we should be getting ts[4] because we just added it previous?

          Why do this?

          -    Scan scan = new Scan();
          -    scan.setFilter(new RowFilter(CompareFilter.CompareOp.EQUAL,
          -      new BinaryComparator(Bytes.toBytes("row0"))));
          +    Scan scan = new Scan(Bytes.toBytes("row0"), Bytes.toBytes("row1"));
          +//    scan.setFilter(new RowFilter(CompareFilter.CompareOp.EQUAL,
          +//      new BinaryComparator(Bytes.toBytes("row0"))));
          

          Otherwise, patch looks great.

          This patch needs to be release noted describing how it changes how Get works.

          Show
          stack added a comment - I'm about 50% through HBASE-2248 -rr-alpha3.txt: ReadWriteConsistencyControl is missing a license (I like thename of this class and its nice and clean looking). This could be final: + private long writeNumber; This class doesn't have to be public: + public static class WriteEntry { Can you explain the below better please? - // The Get returns the latest value but then does not return the - // oldest, which was never deleted, ts[1]. - + + // It used to be due to the internal implementation of Get, that + // the Get() call would return ts[4] UNLIKE the Scan below. With + // the switch to using Scan for Get this is no longer the case . get = new Get(ROW); get.addFamily(FAMILIES[0]); get.setMaxVersions( Integer .MAX_VALUE); result = ht.get(get); assertNResult(result, ROW, FAMILIES[0], QUALIFIER, - new long [] {ts[2], ts[3], ts[4]}, - new byte [][] {VALUES[2], VALUES[3], VALUES[4]}, + new long [] {ts[1], ts[2], ts[3]}, + new byte [][] {VALUES[1], VALUES[2], VALUES[3]}, 0, 2); Reading it, it would seem that we should be getting ts [4] because we just added it previous? Why do this? - Scan scan = new Scan(); - scan.setFilter( new RowFilter(CompareFilter.CompareOp.EQUAL, - new BinaryComparator(Bytes.toBytes( "row0" )))); + Scan scan = new Scan(Bytes.toBytes( "row0" ), Bytes.toBytes( "row1" )); + // scan.setFilter( new RowFilter(CompareFilter.CompareOp.EQUAL, + // new BinaryComparator(Bytes.toBytes( "row0" )))); Otherwise, patch looks great. This patch needs to be release noted describing how it changes how Get works.
          Hide
          stack added a comment -

          Testing I'm hanging when lots of concurrency at HLog.append at the synchronization on the updateLock inside the append method. Its strange. A bunch of threads are BLOCKED at this explicit line. All but one say "waiting to lock". A single thread is BLOCKED but it has 'locked' successfully, as though it should have moved on but it shows same location in stack trace (same line number) and there doesn't seem to be anything in the block that threads could contend over.

          Trying w/ different JVMs to see if I can get move info.

          Show
          stack added a comment - Testing I'm hanging when lots of concurrency at HLog.append at the synchronization on the updateLock inside the append method. Its strange. A bunch of threads are BLOCKED at this explicit line. All but one say "waiting to lock". A single thread is BLOCKED but it has 'locked' successfully, as though it should have moved on but it shows same location in stack trace (same line number) and there doesn't seem to be anything in the block that threads could contend over. Trying w/ different JVMs to see if I can get move info.
          Hide
          stack added a comment -

          So, its not a lockup, rather, stuff is working but really, really slow. It seems to be this patch because going back to a clean hadoop 0.20.2 and the current state of pre_durability branch, all runs fine again (until we do an actual deadlock, i.e. the known deadlock issue). I'll spend more time trying to figure it but here is how it looks when you thread dump:

          Most threads are 'WAITING', etc. and then a good few are like the below BLOCKED:

          "IPC Server handler 36 on 60020" daemon prio=10 tid=0x273e4400 nid=0x2888 waiting for monitor entry [0x257ad000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:646)
                  - waiting to lock <0x3d5ec6a0> (a java.lang.Object)
          

          Invariabley, there is one 'abnormal' BLOCKED that is the same as above – excepting thread names etc. – in all except 'waiting to lock' is instead 'locked' – same line number and everything.

          I'll keep digging.

          Show
          stack added a comment - So, its not a lockup, rather, stuff is working but really, really slow. It seems to be this patch because going back to a clean hadoop 0.20.2 and the current state of pre_durability branch, all runs fine again (until we do an actual deadlock, i.e. the known deadlock issue). I'll spend more time trying to figure it but here is how it looks when you thread dump: Most threads are 'WAITING', etc. and then a good few are like the below BLOCKED: "IPC Server handler 36 on 60020" daemon prio=10 tid=0x273e4400 nid=0x2888 waiting for monitor entry [0x257ad000] java.lang. Thread .State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:646) - waiting to lock <0x3d5ec6a0> (a java.lang. Object ) Invariabley, there is one 'abnormal' BLOCKED that is the same as above – excepting thread names etc. – in all except 'waiting to lock' is instead 'locked' – same line number and everything. I'll keep digging.
          Hide
          Andrew Purtell added a comment -

          I see the same results as Stack testing up on EC2, with 1.6.0_14 (64 bit). First thing I do is warm the cluster with PE --nomapred randomWrite N (usually N=10 or 15). Not a deadlock, but writes get really slow fast. Without the high write concurrency (N=1 or 3) it's better.

          Show
          Andrew Purtell added a comment - I see the same results as Stack testing up on EC2, with 1.6.0_14 (64 bit). First thing I do is warm the cluster with PE --nomapred randomWrite N (usually N=10 or 15). Not a deadlock, but writes get really slow fast. Without the high write concurrency (N=1 or 3) it's better.
          Hide
          Andrew Purtell added a comment -

          We spend 76% of CPU time in ReadWriteConsistencyControl.completeMemstoreInsert. See attached 'profile.png'. Draining the write queue (for 42,175 puts?) explodes 42,166 calls to ReadWriteConsistencyControl.completeMemstoreInsert into 121,118,233 calls to AtomicLong.get and 121,143,574 calls to $WriteEntry.getWriteNumber, each arc represents 50% of the cumulative time there. See attached 'put_call_graph.png'.

          Show
          Andrew Purtell added a comment - We spend 76% of CPU time in ReadWriteConsistencyControl.completeMemstoreInsert. See attached 'profile.png'. Draining the write queue (for 42,175 puts?) explodes 42,166 calls to ReadWriteConsistencyControl.completeMemstoreInsert into 121,118,233 calls to AtomicLong.get and 121,143,574 calls to $WriteEntry.getWriteNumber, each arc represents 50% of the cumulative time there. See attached 'put_call_graph.png'.
          Hide
          Andrew Purtell added a comment -

          From Ryan via email:

          There is a busy wait loop which attempts to ensure a write completes only when it is visible to others. With the log append as part of the "transaction" this is breaking down. The solution is to either forgo the busy wait loop (probably not a great idea) or restructure the code to do hlog appends first then memstore updates.

          I'll talk to stack tomorrow and we can figure which route is better... Although I'd guess option #2

          Show
          Andrew Purtell added a comment - From Ryan via email: There is a busy wait loop which attempts to ensure a write completes only when it is visible to others. With the log append as part of the "transaction" this is breaking down. The solution is to either forgo the busy wait loop (probably not a great idea) or restructure the code to do hlog appends first then memstore updates. I'll talk to stack tomorrow and we can figure which route is better... Although I'd guess option #2
          Hide
          stack added a comment -

          The release note for this issue needs to include note of how versioning and delete changes (I was going to say we should add new issue to add more extensive unit testing of our claim that versions will come out in right order now regardless of how they are put in, but we already have an issue for that – the ACID spec tests issue).

          Show
          stack added a comment - The release note for this issue needs to include note of how versioning and delete changes (I was going to say we should add new issue to add more extensive unit testing of our claim that versions will come out in right order now regardless of how they are put in, but we already have an issue for that – the ACID spec tests issue).
          Hide
          ryan rawson added a comment -

          ok here is a patch that addresses all the above issues:

          • spin fixed by restructuring hlog append
          • index test pass failure fixed
          • test failures due to compaction
          • all comments addressed

          To accomplish the index hbase fix, I had to introduce a new notion of optional scanner creation atomicity along with pre-flush-commit work, so a sub-class can create an atomic section whereby some work is done (eg: switching out an index) and the flush commit (where the snapshot is removed and the hfile is introduced to open scanners) and this atomic section will be atomic relative to new scanner creation. This was required to fix race conditions in indexed hbase, which also means that indexed hbase is not as fast as it can be, since it cannot create new scanners during this one critical phase of flush (which includes re-reading scanner blocks btw).

          Show
          ryan rawson added a comment - ok here is a patch that addresses all the above issues: spin fixed by restructuring hlog append index test pass failure fixed test failures due to compaction all comments addressed To accomplish the index hbase fix, I had to introduce a new notion of optional scanner creation atomicity along with pre-flush-commit work, so a sub-class can create an atomic section whereby some work is done (eg: switching out an index) and the flush commit (where the snapshot is removed and the hfile is introduced to open scanners) and this atomic section will be atomic relative to new scanner creation. This was required to fix race conditions in indexed hbase, which also means that indexed hbase is not as fast as it can be, since it cannot create new scanners during this one critical phase of flush (which includes re-reading scanner blocks btw).
          Hide
          stack added a comment -

          This last patch is looking good. The spin lock slowing writes is gone it looks like and most of the tests are passing. Will report more in morning. Will let tests run overnight.

          Show
          stack added a comment - This last patch is looking good. The spin lock slowing writes is gone it looks like and most of the tests are passing. Will report more in morning. Will let tests run overnight.
          Hide
          stack added a comment -

          All tests but this in 'indexed' pass:

          [junit] Test org.apache.hadoop.hbase.regionserver.TestIdxRegionIndexManager FAILED

          My cluster test failed but for reasons other than would be attributable to this patch. I did not deadlock.

          Show
          stack added a comment - All tests but this in 'indexed' pass: [junit] Test org.apache.hadoop.hbase.regionserver.TestIdxRegionIndexManager FAILED My cluster test failed but for reasons other than would be attributable to this patch. I did not deadlock.
          Hide
          stack added a comment -

          I'm going to commit this. All tests pass if I remove 'indexed'. Patch looks good. I have cluster issues but unrelated to this patch. Logs for regionservers look good.

          Show
          stack added a comment - I'm going to commit this. All tests pass if I remove 'indexed'. Patch looks good. I have cluster issues but unrelated to this patch. Logs for regionservers look good.
          Hide
          stack added a comment -

          Thanks all who contributed to this issue: Todd, Dan, Yoram and in particular Ryan.

          Show
          stack added a comment - Thanks all who contributed to this issue: Todd, Dan, Yoram and in particular Ryan.
          Hide
          ryan rawson added a comment -

          here is the updated version with fixes taken from the work on the 0.20_pre_durability but on plain old 0.20.

          Show
          ryan rawson added a comment - here is the updated version with fixes taken from the work on the 0.20_pre_durability but on plain old 0.20.
          Hide
          ryan rawson added a comment -

          this removes row locks which are no longer necessary to ensure atomic reads

          Show
          ryan rawson added a comment - this removes row locks which are no longer necessary to ensure atomic reads
          Hide
          Jonathan Gray added a comment -

          ryan, can you explain more about removal of row locks? seems like your patch just touches the simple get case that takes a row lock. Are client-exposed row locks completely gone now?

          Show
          Jonathan Gray added a comment - ryan, can you explain more about removal of row locks? seems like your patch just touches the simple get case that takes a row lock. Are client-exposed row locks completely gone now?
          Hide
          stack added a comment -

          @Jon No.. just the row lock around the Get. The client-side row lock still in place.

          I just tested it up on cluster and seems to run fine. Going to commit.

          Show
          stack added a comment - @Jon No.. just the row lock around the Get. The client-side row lock still in place. I just tested it up on cluster and seems to run fine. Going to commit.
          Hide
          ryan rawson added a comment -

          [[ Old comment, sent by email on Wed, 14 Apr 2010 14:43:49 -0700 ]]

          the previous idx test break was a simple heap check break, so easy to
          fix in the eventual destination. the core of the indexed stuff seemed
          to work.

          Show
          ryan rawson added a comment - [[ Old comment, sent by email on Wed, 14 Apr 2010 14:43:49 -0700 ]] the previous idx test break was a simple heap check break, so easy to fix in the eventual destination. the core of the indexed stuff seemed to work.
          Hide
          Andrew Purtell added a comment -

          [[ Old comment, sent by email on Wed, 14 Apr 2010 04:52:37 +0000 ]]

          Thanks Ryan! Testing now.

          Show
          Andrew Purtell added a comment - [[ Old comment, sent by email on Wed, 14 Apr 2010 04:52:37 +0000 ]] Thanks Ryan! Testing now.
          Hide
          stack added a comment -

          Hey Ryan, commit this to branch and TRUNK. Testing over in branch-of-branch says its good.

          Show
          stack added a comment - Hey Ryan, commit this to branch and TRUNK. Testing over in branch-of-branch says its good.

            People

            • Assignee:
              ryan rawson
              Reporter:
              Dave Latham
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development