Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-11425

Cell/DBB end-to-end on the read-path

    Details

    • Type: Umbrella
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.99.0
    • Fix Version/s: 2.0.0
    • Component/s: regionserver, Scanners
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Umbrella jira to make sure we can have blocks cached in offheap backed cache. In the entire read path, we can refer to this offheap buffer and avoid onheap copying.
      The high level items I can identify as of now are
      1. Avoid the array() call on BB in read path.. (This is there in many classes. We can handle class by class)
      2. Support Buffer based getter APIs in cell. In read path we will create a new Cell with backed by BB. Will need in CellComparator, Filter (like SCVF), CPs etc.
      3. Avoid KeyValue.ensureKeyValue() calls in read path - This make byte copy.
      4. Remove all CP hooks (which are already deprecated) which deal with KVs. (In read path)

      Will add subtasks under this.

      1. Benchmarks_Tests.docx
        46 kB
        Anoop Sam John
      2. BenchmarkTestCode.zip
        10 kB
        Anoop Sam John
      3. gc.png
        31 kB
        stack
      4. GC pics with evictions_4G heap.png
        55 kB
        ramkrishna.s.vasudevan
      5. gets.png
        20 kB
        stack
      6. HBASE-11425.patch
        976 kB
        ramkrishna.s.vasudevan
      7. HBASE-11425-E2E-NotComplete.patch
        837 kB
        Anoop Sam John
      8. heap.png
        25 kB
        stack
      9. load.png
        31 kB
        stack
      10. median.png
        41 kB
        stack
      11. Offheap reads in HBase using BBs_final.pdf
        94 kB
        ramkrishna.s.vasudevan
      12. Offheap reads in HBase using BBs_V2.pdf
        115 kB
        Anoop Sam John
      13. ram.log
        8.46 MB
        stack
      14. Screen Shot 2015-10-16 at 5.13.22 PM.png
        442 kB
        stack

        Issue Links

        1.
        Support KeyValueCodec to encode non KeyValue cells. Sub-task Closed Anoop Sam John
         
        2.
        Support DirectByteBuffer usage in HFileBlock Sub-task Closed Anoop Sam John
         
        3.
        HFileBlock backed by Array of ByteBuffers Sub-task Resolved ramkrishna.s.vasudevan
         
        4.
        Facilitate using ByteBuffer backed Cells in the HFileReader Sub-task Resolved Unassigned
         
        5.
        Ensure Cells and its implementations work with Buffers also Sub-task Resolved Unassigned
         
        6.
        Support DirectByteBuffer usage in DataBlock Encoding area Sub-task Closed Anoop Sam John
         
        7.
        Avoid onheap buffer copying at RPCServer#serResponse Sub-task Resolved Anoop Sam John
         
        8.
        Column trackers and delete trackers should deal with BBs Sub-task Resolved Unassigned
         
        9.
        Prevent block eviction under us if reads are in progress from the BBs Sub-task Resolved ramkrishna.s.vasudevan
         
        10.
        Filters should work with ByteBufferedCell Sub-task Resolved Anoop Sam John
         
        11.
        Support DBB usage in Bloom and HFileIndex area Sub-task Closed Anoop Sam John
         
        12.
        Support BB usage in PrefixTree Sub-task Resolved ramkrishna.s.vasudevan
         
        13.
        Memstore and MemstoreScanner should work with BBs. Sub-task Resolved Unassigned
         
        14.
        Redo the hfile index length optimization so cell-based rather than serialized KV key Sub-task Closed stack
         
        15.
        Unsafe based ByteBuffer Comparator Sub-task Resolved Anoop Sam John
         
        16.
        Create ByteBuffer backed Cell Sub-task Resolved Unassigned
         
        17.
        Add unsafe putXXXUnsafe() BB methods to ByteBufferUtils Sub-task Resolved Unassigned
         
        18.
        Change DBEs to work with new BB based cell Sub-task Resolved Anoop Sam John
         
        19.
        Tags to work with ByteBuffer Sub-task Resolved Anoop Sam John
         
        20.
        Add ByteBufferedCell an extension to Cell Sub-task Resolved Anoop Sam John
         
        21.
        Remove deprecated seek/reseek methods from HFileScanner Sub-task Resolved Anoop Sam John
         
        22.
        Deperecate Filter#filterRowKey(byte[] buffer, int offset, int length) in favor of filterRowKey(Cell firstRowCell) Sub-task Resolved Anoop Sam John
         
        23.
        Deprecate RegionObserver#postScannerFilterRow CP hook with byte[],int,int args in favor of taking Cell arg Sub-task Resolved Anoop Sam John
         
        24.
        Change ColumnTracker and SQM to deal with Cell instead of byte[], int, int Sub-task Resolved Anoop Sam John
         
        25.
        Delayed scanner close in KeyValueHeap and StoreScanner Sub-task Resolved Anoop Sam John
         
        26.
        Change RegionScannerImpl to deal with Cell instead of byte[], int, int Sub-task Resolved Anoop Sam John
         
        27.
        Create MultiByteBuffer an aggregation of ByteBuffers Sub-task Resolved Anoop Sam John
         
        28.
        Close the scanner only after Call#setResponse Sub-task Resolved Anoop Sam John
         
        29.
        Use BufferBackedCell in read path after HBASE-12213 and HBASE-12295 Sub-task Resolved ramkrishna.s.vasudevan
         
        30.
        Change ByteBuff.getXXXStrictlyForward to relative position based reads Sub-task Resolved Anoop Sam John
         
        31.
        ByteBufferUtils#compareTo small optimization Sub-task Resolved Anoop Sam John
         
        32.
        Bloomfilter path to work with Byte buffered cells Sub-task Resolved ramkrishna.s.vasudevan
         
        33.
        Reduce garbage we create Sub-task Resolved Anoop Sam John
         
        34.
        Short circuit last byte check in CellUtil#matchingXXX methods for ByteBufferedCells Sub-task Resolved Anoop Sam John
         
        35.
        Create the fake keys required in the scan path to avoid copy to byte[] Sub-task Resolved ramkrishna.s.vasudevan
         
        36.
        Small optimization in SingleByteBuff Sub-task Resolved Anoop Sam John
         
        37.
        Short circuit checks in ByteBufferUtils compare/equals Sub-task Resolved Unassigned
         
        38.
        Shorten ByteBufferedCell#getXXXPositionInByteBuffer method name Sub-task Resolved Anoop Sam John
         
        39.
        Clear HFileScannerImpl#prevBlocks in between Compaction flow Sub-task Resolved Anoop Sam John
         
        40.
        Ensure write paths work with ByteBufferedCells in case of compaction Sub-task Resolved ramkrishna.s.vasudevan
         
        41.
        Support OffheapKV write in compaction with out copying data on heap Sub-task Resolved Anoop Sam John
         
        42.
        Tightening of the CP contract Sub-task Resolved Anoop Sam John
         
        43.
        Unnecessary lock in ByteBufferArray Sub-task Resolved Anoop Sam John
         
        44.
        References to previous cell in read path should be avoided Sub-task Resolved ramkrishna.s.vasudevan
         
        45.
        LRUBlockCache#returnBlock should try return block to Victim Handler L2 cache. Sub-task Resolved Anoop Sam John
         
        46.
        L1 cache caching shared memory HFile block when blocks promoted from L2 to L1 Sub-task Resolved Anoop Sam John
         

          Activity

          Hide
          anoop.hbase Anoop Sam John added a comment -

          Now one related Q is we go with BR rather than BB for APIs.

          Show
          anoop.hbase Anoop Sam John added a comment - Now one related Q is we go with BR rather than BB for APIs.
          Hide
          apurtell Andrew Purtell added a comment -

          Now one related Q is we go with BR rather than BB for APIs.

          +1

          We need BR instead of BB to work around issues with BB API issues: inlining pessimism, range checking and index compensations that cannot be skipped for performance, and related.

          Show
          apurtell Andrew Purtell added a comment - Now one related Q is we go with BR rather than BB for APIs. +1 We need BR instead of BB to work around issues with BB API issues: inlining pessimism, range checking and index compensations that cannot be skipped for performance, and related.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Avoid KeyValue.ensureKeyValue() calls in read path - This make byte copy.

          This is the idea for changing to Cells in the read path. Still there are some places which this was not achieved. I will those tasks to this.

          We need BR instead of BB to work around issues with BB API issues

          Yes. +1 on t his. HBASE-10772, HBASE-10773 are all those related to this. I can link the related tasks to this JIRA to have a clear picture on the subtasks.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Avoid KeyValue.ensureKeyValue() calls in read path - This make byte copy. This is the idea for changing to Cells in the read path. Still there are some places which this was not achieved. I will those tasks to this. We need BR instead of BB to work around issues with BB API issues Yes. +1 on t his. HBASE-10772 , HBASE-10773 are all those related to this. I can link the related tasks to this JIRA to have a clear picture on the subtasks.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          All the subtasks under HBASE-7320 are in some way helping this. Few more may be needed here.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - All the subtasks under HBASE-7320 are in some way helping this. Few more may be needed here.
          Hide
          stack stack added a comment -

          Thanks for filing this one and yeah lets finish up the Cell convertions.

          How we thinking to back Cells w/ bytes that are back in the block cache? Currently we copy the block cache bytes onheap to guard against the blocks being evicted out from under us. We'll do reference counting?

          Any idea of how much slower an offheap merge sort will be doing BB#get (or BR#get)? I'm up for doing a bit of measuring....

          Show
          stack stack added a comment - Thanks for filing this one and yeah lets finish up the Cell convertions. How we thinking to back Cells w/ bytes that are back in the block cache? Currently we copy the block cache bytes onheap to guard against the blocks being evicted out from under us. We'll do reference counting? Any idea of how much slower an offheap merge sort will be doing BB#get (or BR#get)? I'm up for doing a bit of measuring....
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Currently we copy the block cache bytes onheap to guard against the blocks being evicted out from under us. We'll do reference counting?

          I think this is one major area where we may have to work. If a block is currently getting scanned set a reference and do not evict them. On what basis should we select the next block that can be evicted ? Need to do some more analysis on this.
          Interesting!!.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Currently we copy the block cache bytes onheap to guard against the blocks being evicted out from under us. We'll do reference counting? I think this is one major area where we may have to work. If a block is currently getting scanned set a reference and do not evict them. On what basis should we select the next block that can be evicted ? Need to do some more analysis on this. Interesting!!.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Yes we need ref counting. The MemstoreSlab chunk pool is having a similar need and doing the ref counting way.

          Any idea of how much slower an offheap merge sort will be doing BB#get (or BR#get)? I'm up for doing a bit of measuring....

          Not done any compare test. Will do some plain test doing just key compare(millions of times) reading from DBB and HBB(This will use the unsafe compare). Got into some bug fixes on visibility. Will start again next week. That will be great if u can measure boss.

          Show
          anoop.hbase Anoop Sam John added a comment - Yes we need ref counting. The MemstoreSlab chunk pool is having a similar need and doing the ref counting way. Any idea of how much slower an offheap merge sort will be doing BB#get (or BR#get)? I'm up for doing a bit of measuring.... Not done any compare test. Will do some plain test doing just key compare(millions of times) reading from DBB and HBB(This will use the unsafe compare). Got into some bug fixes on visibility. Will start again next week. That will be great if u can measure boss.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          We need BR instead of BB to work around issues with BB API issues: inlining pessimism, range checking and index compensations that cannot be skipped for performance, and related.

          Yes. So we will have our own written HeapBB/DirectBB stuff than just wrapping the nio objects.

          Show
          anoop.hbase Anoop Sam John added a comment - We need BR instead of BB to work around issues with BB API issues: inlining pessimism, range checking and index compensations that cannot be skipped for performance, and related. Yes. So we will have our own written HeapBB/DirectBB stuff than just wrapping the nio objects.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Testing with a 2 million Cells with single cell per row.
          Writing all cells to a BB/DBB and trying a seek with to last kv. (To make compare across all cells in BB/DBB)
          Seek code is like what we have in ScannerV3#blockSeek

          with RK length 17 bytes (1st 13 bytes are same) Getting almost same result.
          With RK length 117 bytes (1st 113 bytes are same) the DBB based read is ~3% degrade.

          Show
          anoop.hbase Anoop Sam John added a comment - Testing with a 2 million Cells with single cell per row. Writing all cells to a BB/DBB and trying a seek with to last kv. (To make compare across all cells in BB/DBB) Seek code is like what we have in ScannerV3#blockSeek with RK length 17 bytes (1st 13 bytes are same) Getting almost same result. With RK length 117 bytes (1st 113 bytes are same) the DBB based read is ~3% degrade.
          Hide
          apurtell Andrew Purtell added a comment -

          This is with BB or BR? How about 1kb RKs?

          Show
          apurtell Andrew Purtell added a comment - This is with BB or BR? How about 1kb RKs?
          Hide
          anoop.hbase Anoop Sam John added a comment -

          BB only. And not the actual read perf number. Just the seek part and so reads from BB vs DBB

          How about 1kb RKs?

          Will test more cases.

          Show
          anoop.hbase Anoop Sam John added a comment - BB only. And not the actual read perf number. Just the seek part and so reads from BB vs DBB How about 1kb RKs? Will test more cases.
          Hide
          apurtell Andrew Purtell added a comment -

          Does it make a difference if using the BR API instead?

          Show
          apurtell Andrew Purtell added a comment - Does it make a difference if using the BR API instead?
          Hide
          stack stack added a comment -

          Nice Anoop Sam John Mind posting your little test tool so can run local?

          Show
          stack stack added a comment - Nice Anoop Sam John Mind posting your little test tool so can run local?
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Testing with a 2 million Cells with single cell per row.
          Writing all cells to a BB/DBB and trying a seek with to last kv. (To make compare across all cells in BB/DBB)
          Seek code is like what we have in ScannerV3#blockSeek
          with RK length 17 bytes (1st 13 bytes are same) Getting almost same result.
          With RK length 117 bytes (1st 113 bytes are same) the DBB based read is ~3% degrade

          Well in this test, the read and compare were from HBB and DBB and those are almost same.
          In case of our CellComparator we have Unsafe based optimization. In my old test this was not in use. With Unsafe based read from HBB#array() [this is what happens in HFileReaderV2/V3] there is a significant perf diff with DBB. Here RK length of 117 bytes and 2 millions cells and we seek to last cell, the DBB test is 50% slower.

          I am thinking of doing Unsafe based compares for data in DBB as well.

          Just done Unsafe based access from DBB/HBB and then we are in a better shape. The DBB based above test is ~12% slower than old HBB.array() based compares. Will raise a subtask and attach the approach there.

          Show
          anoop.hbase Anoop Sam John added a comment - Testing with a 2 million Cells with single cell per row. Writing all cells to a BB/DBB and trying a seek with to last kv. (To make compare across all cells in BB/DBB) Seek code is like what we have in ScannerV3#blockSeek with RK length 17 bytes (1st 13 bytes are same) Getting almost same result. With RK length 117 bytes (1st 113 bytes are same) the DBB based read is ~3% degrade Well in this test, the read and compare were from HBB and DBB and those are almost same. In case of our CellComparator we have Unsafe based optimization. In my old test this was not in use. With Unsafe based read from HBB#array() [this is what happens in HFileReaderV2/V3] there is a significant perf diff with DBB. Here RK length of 117 bytes and 2 millions cells and we seek to last cell, the DBB test is 50% slower. I am thinking of doing Unsafe based compares for data in DBB as well. Just done Unsafe based access from DBB/HBB and then we are in a better shape. The DBB based above test is ~12% slower than old HBB.array() based compares. Will raise a subtask and attach the approach there.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          A document explaining the motive, the design considerations, the reason for arriving at BB and the Cell level APIS changes required for supporting offheap memory in HBase's read path.
          We will be uploading a patch ported to trunk shortly by the end of this week or early next week. Along with some perf results.
          Request feedback/comments on the doc and the approach.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - A document explaining the motive, the design considerations, the reason for arriving at BB and the Cell level APIS changes required for supporting offheap memory in HBase's read path. We will be uploading a patch ported to trunk shortly by the end of this week or early next week. Along with some perf results. Request feedback/comments on the doc and the approach.
          Hide
          Apache9 Duo Zhang added a comment -

          I'm still a little worried about the ref counting part as I said before.
          Sometimes it could be a disaster for later developers because it is easy to miss a decrement but very hard to know a problem is caused by missing a decrement, and even we know, it is hard to find where we miss it...
          Let's see the code, maybe we could find a way to handle it cleanly.

          BTW, it should be a typo, you mean HBASE-13142 at the end of the document? We do not have HBASE-13412 yet.

          Show
          Apache9 Duo Zhang added a comment - I'm still a little worried about the ref counting part as I said before. Sometimes it could be a disaster for later developers because it is easy to miss a decrement but very hard to know a problem is caused by missing a decrement, and even we know, it is hard to find where we miss it... Let's see the code, maybe we could find a way to handle it cleanly. BTW, it should be a typo, you mean HBASE-13142 at the end of the document? We do not have HBASE-13412 yet.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          The ref count and increment/decrement happens in one place. You can get more when seeing code.
          Yes it should be HBASE-13412 yet. Thanks for correcting.

          Show
          anoop.hbase Anoop Sam John added a comment - The ref count and increment/decrement happens in one place. You can get more when seeing code. Yes it should be HBASE-13412 yet. Thanks for correcting.
          Hide
          Apache9 Duo Zhang added a comment -

          Anoop Sam John increment/decrement in one place sounds good to be.

          Show
          Apache9 Duo Zhang added a comment - Anoop Sam John increment/decrement in one place sounds good to be.
          Hide
          stack stack added a comment -

          Thanks for the writeup. Makes it easier discussing this new dev.

          "Typical used value for max heap size is 32-48 GB."

          This ain't right, is it? Usually we have folks hover just below 32G so can do compressed pointers.

          "Each bucket’s size is fixed to 4KB."

          Should bucket size be same as the hfile block size?

          Can MBB be developed in isolation with tests and refcounting tests apart from main code base? Is that being done?

          High-level, general question: So eviction was easy before. When memory pressure just evict until needed memory is made available. The eviction is now made more complicated because have to check for non-zero refcount? And what if can't find necessary memory? What happens?

          "Note that the LRU Cache does not have this block reference counting happening as that does not deal with BBs and deals with the HFileblock objects directly."

          Why not? We copy from the LRU blocks to Cell arrays? Couldn't Cells go against the LRU blocks directly too? Or I have it wrong?

          I don't see a downside listing that we'll be doubling the objects made when offheap reading. Is that right?

          "Please note that the Cells in the memstore are still KV based (byte [] backed)" ... this is because you are only doing read-path in this JIRA, right? Then again, reading, we have to read from the MemStore so this means that read path can be a mix of onheap and offheap results?

          On adding new methods to Cell, are there 'holes'? We talked about this in the past and it seemed like there could be strange areas in the Cell API if you did certain calls. If you don't know what I am on about, I'll dig up the old discussion (I think it was on mailing list... Ram you asked for input).

          ... or maybe the holes have been plugged by 'Using getXXXArray() would throw UnSupportedOperationException. '? And....
          "This will make so many short living objects creation also. That is why we decided to go with usage of getXXXOffset() and getXXXLength() API usage also along with buffer based APIs"

          So, you might want to underline this point. Its BB but WE are managing the position and length to save on object creation and to bypass BB range checking, etc.

          What does that mean for the 'client'? When you give out a BB, its position, etc., is not to be relied upon. That will be disorientating. Pity you couldn't throw unsupportedexception if they tried use position, etc. So you need BB AND the Cell to get at content. BB for the array and then Cell for the offset and length...

          So, this API is for users on client-side? It is going to confuse them when they have a BB but the position and limit are duds. In client, when would they be doing BB? Never? Client won't be offheaping? If so, could the BB APIs be mixed in to Cell on the server only?

          So, why have the switch at all? The hasArray switch? Why not BB it all the time? It would simplify the read path. Disadvantage would be it'd be extra objects?

          When you say this: "Note that even if the HFileBlock is on heap BB we do not support getXXXArray() APIs. " This is only if hasArray returns false, right?

          Yeah, looks like 2.0.

          Tell us more about the unsafe manipulation of BBs? How's that work?

          Nice writeup.

          Show
          stack stack added a comment - Thanks for the writeup. Makes it easier discussing this new dev. "Typical used value for max heap size is 32-48 GB." This ain't right, is it? Usually we have folks hover just below 32G so can do compressed pointers. "Each bucket’s size is fixed to 4KB." Should bucket size be same as the hfile block size? Can MBB be developed in isolation with tests and refcounting tests apart from main code base? Is that being done? High-level, general question: So eviction was easy before. When memory pressure just evict until needed memory is made available. The eviction is now made more complicated because have to check for non-zero refcount? And what if can't find necessary memory? What happens? "Note that the LRU Cache does not have this block reference counting happening as that does not deal with BBs and deals with the HFileblock objects directly." Why not? We copy from the LRU blocks to Cell arrays? Couldn't Cells go against the LRU blocks directly too? Or I have it wrong? I don't see a downside listing that we'll be doubling the objects made when offheap reading. Is that right? "Please note that the Cells in the memstore are still KV based (byte [] backed)" ... this is because you are only doing read-path in this JIRA, right? Then again, reading, we have to read from the MemStore so this means that read path can be a mix of onheap and offheap results? On adding new methods to Cell, are there 'holes'? We talked about this in the past and it seemed like there could be strange areas in the Cell API if you did certain calls. If you don't know what I am on about, I'll dig up the old discussion (I think it was on mailing list... Ram you asked for input). ... or maybe the holes have been plugged by 'Using getXXXArray() would throw UnSupportedOperationException. '? And.... "This will make so many short living objects creation also. That is why we decided to go with usage of getXXXOffset() and getXXXLength() API usage also along with buffer based APIs" So, you might want to underline this point. Its BB but WE are managing the position and length to save on object creation and to bypass BB range checking, etc. What does that mean for the 'client'? When you give out a BB, its position, etc., is not to be relied upon. That will be disorientating. Pity you couldn't throw unsupportedexception if they tried use position, etc. So you need BB AND the Cell to get at content. BB for the array and then Cell for the offset and length... So, this API is for users on client-side? It is going to confuse them when they have a BB but the position and limit are duds. In client, when would they be doing BB? Never? Client won't be offheaping? If so, could the BB APIs be mixed in to Cell on the server only? So, why have the switch at all? The hasArray switch? Why not BB it all the time? It would simplify the read path. Disadvantage would be it'd be extra objects? When you say this: "Note that even if the HFileBlock is on heap BB we do not support getXXXArray() APIs. " This is only if hasArray returns false, right? Yeah, looks like 2.0. Tell us more about the unsafe manipulation of BBs? How's that work? Nice writeup.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          I am quite skeptical to all this idea ... Here is why:

          Off heap cache can store blocks in a compressed form. It means that you won't be able to back HFileBlock by a such compressed block - you have to decompress it first. From performance point of view it does not matter whether you do this into direct BB (new approach) or into a byte array-backed BB (existing).

          Am I missing anything?

          Show
          vrodionov Vladimir Rodionov added a comment - I am quite skeptical to all this idea ... Here is why: Off heap cache can store blocks in a compressed form. It means that you won't be able to back HFileBlock by a such compressed block - you have to decompress it first. From performance point of view it does not matter whether you do this into direct BB (new approach) or into a byte array-backed BB (existing). Am I missing anything?
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Having block data in compressed form in the BC is optional thing. In such a case, yes, have to decompress first and at that time, it can be to a byte array backed BB. We are not trying to change that.
          The change is when the data is cached in the non compressed form (But can be in the DBE form). Then avoiding need for copy. The block can be backed by N offheap buckets. Cells are made out of that. And cells are backed by buffers rather than byte[] then

          Show
          anoop.hbase Anoop Sam John added a comment - Having block data in compressed form in the BC is optional thing. In such a case, yes, have to decompress first and at that time, it can be to a byte array backed BB. We are not trying to change that. The change is when the data is cached in the non compressed form (But can be in the DBE form). Then avoiding need for copy. The block can be backed by N offheap buckets. Cells are made out of that. And cells are backed by buffers rather than byte[] then
          Hide
          anoop.hbase Anoop Sam John added a comment -

          This ain't right, is it? Usually we have folks hover just below 32G so can do compressed pointers.

          I think I have seen in mails some users have 48G also. At least some users who were trying some PoCs.(Offline I met).. Any way can change to ~32G.

          Should bucket size be same as the hfile block size?

          I wanted to come to this topic. This is hard coded now. Can this be made configurable? If this can be larger value,like block size, better as per our changes

          Can MBB be developed in isolation with tests and refcounting tests apart from main code base? Is that being done?

          Yep. When put patches, we can make sure to do this way. Those in sub tasks.

          The eviction is now made more complicated because have to check for non-zero refcount? And what if can't find necessary memory? What happens?

          The eviction try evict some unused blocks. If all are like in read (worst case), the new block can not be cached. May be that should be tried after a delay?

          Why not? We copy from the LRU blocks to Cell arrays? Couldn't Cells go against the LRU blocks directly too? Or I have it wrong?

          In the LRU we cache the block object itself. It has its own underlying memory. Even if an in read progress block is evicted, the memory area it refers to , is not freed. Only thing is that after this read, that block will not be referenced and so the block data area too. Am I making it clear?

          I don't see a downside listing that we'll be doubling the objects made when offheap reading. Is that right?

          In read say we deal with N HFileBlock, we will be having extra objects MBB objects created for each block. But per cell we wont create any new objects. In comparators etc, we check hasArray() and based on that use the buffer/array based APIs. When creating BB backed cells from an HFileBlock which is backed by MBB, we try best to refer to original BB (and item in MBB) and not create/duplicate extra BBs. But yes some etra objects will be there. (duplicated BBs) I can give a count based on a test scenario Stack. Was in middle of some thing else and missed doing this.

          have to read from the MemStore so this means that read path can be a mix of onheap and offheap results?

          yes

          or maybe the holes have been plugged by 'Using getXXXArray() would throw UnSupportedOperationException. '? And....

          Yep. If the Cell impl is backed by a BB (on heap/off heap) its getXXXArray APIs will throw UnSupportedOperationException

          So, you might want to underline this point. Its BB but WE are managing the position and length to save on object creation and to bypass BB range checking, etc

          Yes. correct

          Client won't be offheaping? If so, could the BB APIs be mixed in to Cell on the server only?

          Some thing like a ServerCell which extend Cell? Sounds reasonable.. Have some discuss like this also.

          So, why have the switch at all? The hasArray switch? Why not BB it all the time? It would simplify the read path. Disadvantage would be it'd be extra objects?

          Yes the extra BB wrapper which has to be created every time one calls getXXXArray(). It is an extra obj creation and some ops (like limit, pos checks) which happens in the BB classes. That is bit costly only. Had done some Unit tests. Ram have the numbers or so?

          When you say this: "Note that even if the HFileBlock is on heap BB we do not support getXXXArray() APIs. " This is only if hasArray returns false, right?

          Yes when hasArray return false. The point is when the Cell is backed by a buffer then we will have hasArray as false. (whether DBB/HBB)

          Tell us more about the unsafe manipulation of BBs? How's that work?

          It reads data from BB bypassing the BB APIs. Directly read from memory. HBASE-12345 having a patch which add Unsafe based compare for data in BB. Similar way added for reading int/long etc. Same we do for bytes in Bytes.java

          Show
          anoop.hbase Anoop Sam John added a comment - This ain't right, is it? Usually we have folks hover just below 32G so can do compressed pointers. I think I have seen in mails some users have 48G also. At least some users who were trying some PoCs.(Offline I met).. Any way can change to ~32G. Should bucket size be same as the hfile block size? I wanted to come to this topic. This is hard coded now. Can this be made configurable? If this can be larger value,like block size, better as per our changes Can MBB be developed in isolation with tests and refcounting tests apart from main code base? Is that being done? Yep. When put patches, we can make sure to do this way. Those in sub tasks. The eviction is now made more complicated because have to check for non-zero refcount? And what if can't find necessary memory? What happens? The eviction try evict some unused blocks. If all are like in read (worst case), the new block can not be cached. May be that should be tried after a delay? Why not? We copy from the LRU blocks to Cell arrays? Couldn't Cells go against the LRU blocks directly too? Or I have it wrong? In the LRU we cache the block object itself. It has its own underlying memory. Even if an in read progress block is evicted, the memory area it refers to , is not freed. Only thing is that after this read, that block will not be referenced and so the block data area too. Am I making it clear? I don't see a downside listing that we'll be doubling the objects made when offheap reading. Is that right? In read say we deal with N HFileBlock, we will be having extra objects MBB objects created for each block. But per cell we wont create any new objects. In comparators etc, we check hasArray() and based on that use the buffer/array based APIs. When creating BB backed cells from an HFileBlock which is backed by MBB, we try best to refer to original BB (and item in MBB) and not create/duplicate extra BBs. But yes some etra objects will be there. (duplicated BBs) I can give a count based on a test scenario Stack. Was in middle of some thing else and missed doing this. have to read from the MemStore so this means that read path can be a mix of onheap and offheap results? yes or maybe the holes have been plugged by 'Using getXXXArray() would throw UnSupportedOperationException. '? And.... Yep. If the Cell impl is backed by a BB (on heap/off heap) its getXXXArray APIs will throw UnSupportedOperationException So, you might want to underline this point. Its BB but WE are managing the position and length to save on object creation and to bypass BB range checking, etc Yes. correct Client won't be offheaping? If so, could the BB APIs be mixed in to Cell on the server only? Some thing like a ServerCell which extend Cell? Sounds reasonable.. Have some discuss like this also. So, why have the switch at all? The hasArray switch? Why not BB it all the time? It would simplify the read path. Disadvantage would be it'd be extra objects? Yes the extra BB wrapper which has to be created every time one calls getXXXArray(). It is an extra obj creation and some ops (like limit, pos checks) which happens in the BB classes. That is bit costly only. Had done some Unit tests. Ram have the numbers or so? When you say this: "Note that even if the HFileBlock is on heap BB we do not support getXXXArray() APIs. " This is only if hasArray returns false, right? Yes when hasArray return false. The point is when the Cell is backed by a buffer then we will have hasArray as false. (whether DBB/HBB) Tell us more about the unsafe manipulation of BBs? How's that work? It reads data from BB bypassing the BB APIs. Directly read from memory. HBASE-12345 having a patch which add Unsafe based compare for data in BB. Similar way added for reading int/long etc. Same we do for bytes in Bytes.java
          Hide
          stack stack added a comment -

          Vladimir Rodionov

          Sortof.

          This effort takes us further along a couple of paths. There is the foreground being able to have most of the data offheap when reading. An ancillary is proving an alternate Cell implementation is possible, one that is other than KeyValue.

          After the lads have the above behind us, we can move to the next interesting challenges. For example, a PrefixTree Cell implementation that keeps the key and value encoded/compressed as we traverse the read path.

          Regards the particular point you raise, yeah, would have to decompress currently do put this read-path on top of it. Would be cool if decompress could be done with native code before we the brought the block into BC.

          TODO

          Show
          stack stack added a comment - Vladimir Rodionov Sortof. This effort takes us further along a couple of paths. There is the foreground being able to have most of the data offheap when reading. An ancillary is proving an alternate Cell implementation is possible, one that is other than KeyValue. After the lads have the above behind us, we can move to the next interesting challenges. For example, a PrefixTree Cell implementation that keeps the key and value encoded/compressed as we traverse the read path. Regards the particular point you raise, yeah, would have to decompress currently do put this read-path on top of it. Would be cool if decompress could be done with native code before we the brought the block into BC. TODO
          Hide
          vrodionov Vladimir Rodionov added a comment -

          The change is when the data is cached in the non compressed form (But can be in the DBE form)

          Anoop Sam John, if the goal of this JIRA performance improvement, have you estimate the following scenarios:

          1. DBE - ON, block compression - OFF (byte array end-to-end - BA)
          2. DBE - OFF, block compression - OFF (BA)
          3. DBE - ON, block compression - ON (BA)
          4. DBE - OFF, block compression - ON (BA)
          5. DBE - ON, block compression - OFF (byte buffer end-to-end - BB)
          6. DBE - OFF, block compression - OFF (BB)
          7. DBE - ON, block compression - ON (BB)
          8. DBE - OFF, block compression - ON (BB)

          You optimize No 5 use case only. Do you think its going to be faster than any of first four (BA)? People like compression, especially if it does not affect benchmark performance too much and helps them with their application. Essentially, you advise users not to use compression and use only DBE, but all scan operations and get/multi get take performance hit with DBE enabled. I think No. 4 (DBE - OFF, block compression ON, byte array backed) is going to be faster than No. 5 (DBE - ON, block compression OFF , byte buffer - backed) in any benchmark.

          These are my words of caution ... before you start such a large project - make sure that benefits you are hoping for are really possible.

          Show
          vrodionov Vladimir Rodionov added a comment - The change is when the data is cached in the non compressed form (But can be in the DBE form) Anoop Sam John , if the goal of this JIRA performance improvement, have you estimate the following scenarios: DBE - ON, block compression - OFF (byte array end-to-end - BA) DBE - OFF, block compression - OFF (BA) DBE - ON, block compression - ON (BA) DBE - OFF, block compression - ON (BA) DBE - ON, block compression - OFF (byte buffer end-to-end - BB) DBE - OFF, block compression - OFF (BB) DBE - ON, block compression - ON (BB) DBE - OFF, block compression - ON (BB) You optimize No 5 use case only. Do you think its going to be faster than any of first four (BA)? People like compression, especially if it does not affect benchmark performance too much and helps them with their application. Essentially, you advise users not to use compression and use only DBE, but all scan operations and get/multi get take performance hit with DBE enabled. I think No. 4 (DBE - OFF, block compression ON, byte array backed) is going to be faster than No. 5 (DBE - ON, block compression OFF , byte buffer - backed) in any benchmark. These are my words of caution ... before you start such a large project - make sure that benefits you are hoping for are really possible.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          Would be cool if decompress could be done with native code before we the brought the block into BC.

          bigbase.org does that . Block Cache compression is all native. Unfortunately, do not have time to continue working on this project now, may be in a near future.

          Show
          vrodionov Vladimir Rodionov added a comment - Would be cool if decompress could be done with native code before we the brought the block into BC. bigbase.org does that . Block Cache compression is all native. Unfortunately, do not have time to continue working on this project now, may be in a near future.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          I think No. 4 (DBE - OFF, block compression ON, byte array backed) is going to be faster than No. 5 (DBE - ON, block compression OFF , byte buffer - backed) in any benchmark.

          in almost any. Scan with lots of skips will be faster when block compression is off, I think.

          Show
          vrodionov Vladimir Rodionov added a comment - I think No. 4 (DBE - OFF, block compression ON, byte array backed) is going to be faster than No. 5 (DBE - ON, block compression OFF , byte buffer - backed) in any benchmark. in almost any. Scan with lots of skips will be faster when block compression is off, I think.
          Hide
          stack stack added a comment -

          I wanted to come to this topic. This is hard coded now. Can this be made configurable? If this can be larger value,like block size, better as per our changes

          Yeah, configurable sounds right but thinking on it more, rare will be the case that the fuzzy hfile block will fit into the BC hard-coded block.

          May be that should be tried after a delay?

          Would this be a new block type? One that is not backed by BC? No on delay. Flag its happening and move on. It'd be an offheap allocation I suppose so will be delay enough (smlie).

          Am I making it clear?

          Yes.

          Some thing like a ServerCell which extend Cell?

          We can't give users an API that has you get data from a BB but you need to use the enclosing Cell to figure where to read from and how much. Users will kick us out!

          Yes the extra BB wrapper which has to be created every time one calls getXXXArray().

          Would be interesting to see cost. Would be sweet if only one readpath... but I'd imagine the perf difference will be too great so we'll have to have two.

          It reads data from BB bypassing the BB APIs.

          Add this to doc.

          Thanks.

          Show
          stack stack added a comment - I wanted to come to this topic. This is hard coded now. Can this be made configurable? If this can be larger value,like block size, better as per our changes Yeah, configurable sounds right but thinking on it more, rare will be the case that the fuzzy hfile block will fit into the BC hard-coded block. May be that should be tried after a delay? Would this be a new block type? One that is not backed by BC? No on delay. Flag its happening and move on. It'd be an offheap allocation I suppose so will be delay enough (smlie). Am I making it clear? Yes. Some thing like a ServerCell which extend Cell? We can't give users an API that has you get data from a BB but you need to use the enclosing Cell to figure where to read from and how much. Users will kick us out! Yes the extra BB wrapper which has to be created every time one calls getXXXArray(). Would be interesting to see cost. Would be sweet if only one readpath... but I'd imagine the perf difference will be too great so we'll have to have two. It reads data from BB bypassing the BB APIs. Add this to doc. Thanks.
          Hide
          stack stack added a comment -

          I like Vladimir Rodionov 's grid. If only for illustration of where you are focused, suggest you add to the doc Anoop Sam John Would be good to get some answers on his questions too.

          BigBase open source Vladimir?

          Show
          stack stack added a comment - I like Vladimir Rodionov 's grid. If only for illustration of where you are focused, suggest you add to the doc Anoop Sam John Would be good to get some answers on his questions too. BigBase open source Vladimir?
          Hide
          vrodionov Vladimir Rodionov added a comment -

          BigBase open source Vladimir?

          Yes, but is not Apache. yet .

          Show
          vrodionov Vladimir Rodionov added a comment - BigBase open source Vladimir? Yes, but is not Apache. yet .
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Should bucket size be same as the hfile block size?

          Yes. that would be better in many cases how ever the odd blocks may go beyond the hfile block size.

          Can MBB be developed in isolation with tests and refcounting tests apart from main code base? Is that being done?

          We need some tests for the refcounting part. Apart from that they can be individual tasks as Anoop says.

          Reg the BB and comparators having two paths, that would be the ideal way as per the profiler reports. That is because for all the KVs that is coming from the HHFiles we have Buffer backed cells. But for the cells in memstore is byte[]. So as mentioned in the doc, if we try to create only BB based rows, families and qualifiers, we may have to do wrapping of these byte[]. That is a costlier operation. Also in cases of creating fake keys it is always better to create fake keys in byte[] rather than in BB because for BB's we have to do some allocation and then copy the contents. All these are costlier.
          Hence when we create a fake key and compare it against a key from HFile we have two version of cells. One backed by byte[] and another by BB. So it would be better if/else based comparisons.

          Reg the Unsafe comparators,
          They are just the same as in byte[] array now.

          So, you might want to underline this point. Its BB but WE are managing the position and length to save on object creation and to bypass BB range checking, etc.

          Yes. That is the important decision that we had to make. One objective is to reduce the objects creation and another is to use the same APIs for offset and length.

          Client won't be offheaping? If so, could the BB APIs be mixed in to Cell on the server only?

          We discussed on that ServerCell concepts. But I would argue not do that because then the user would have two types of Cells - one on the write path and the other cell on the read path. I would say that would make things more complex and not much ease of use too.

          I would try to make a trunk based patch and upload for reference.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Should bucket size be same as the hfile block size? Yes. that would be better in many cases how ever the odd blocks may go beyond the hfile block size. Can MBB be developed in isolation with tests and refcounting tests apart from main code base? Is that being done? We need some tests for the refcounting part. Apart from that they can be individual tasks as Anoop says. Reg the BB and comparators having two paths, that would be the ideal way as per the profiler reports. That is because for all the KVs that is coming from the HHFiles we have Buffer backed cells. But for the cells in memstore is byte[]. So as mentioned in the doc, if we try to create only BB based rows, families and qualifiers, we may have to do wrapping of these byte[]. That is a costlier operation. Also in cases of creating fake keys it is always better to create fake keys in byte[] rather than in BB because for BB's we have to do some allocation and then copy the contents. All these are costlier. Hence when we create a fake key and compare it against a key from HFile we have two version of cells. One backed by byte[] and another by BB. So it would be better if/else based comparisons. Reg the Unsafe comparators, They are just the same as in byte[] array now. So, you might want to underline this point. Its BB but WE are managing the position and length to save on object creation and to bypass BB range checking, etc. Yes. That is the important decision that we had to make. One objective is to reduce the objects creation and another is to use the same APIs for offset and length. Client won't be offheaping? If so, could the BB APIs be mixed in to Cell on the server only? We discussed on that ServerCell concepts. But I would argue not do that because then the user would have two types of Cells - one on the write path and the other cell on the read path. I would say that would make things more complex and not much ease of use too. I would try to make a trunk based patch and upload for reference.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Adding to what Anoop says, we are not trying to focus on where we do compression. That part remains the same. It is only after we start using an Hfileblock we tend to use it in a offheap mode and particularly avoid the copy that happens in the BucketCache every time a block needs to be used. (In this case it is going to be a decompressed block only).
          But one point to note here is that in the case of DBE it is an encoded block - and we still go on with the encoded block only and the existing logic of decoding the block still works the same way.
          In the existing code there are two copies that happen here - one from the BucketCache and other in the DBE algo.
          Now we try to avoid the first one.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Adding to what Anoop says, we are not trying to focus on where we do compression. That part remains the same. It is only after we start using an Hfileblock we tend to use it in a offheap mode and particularly avoid the copy that happens in the BucketCache every time a block needs to be used. (In this case it is going to be a decompressed block only). But one point to note here is that in the case of DBE it is an encoded block - and we still go on with the encoded block only and the existing logic of decoding the block still works the same way. In the existing code there are two copies that happen here - one from the BucketCache and other in the DBE algo. Now we try to avoid the first one.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Would be good to get some answers on his questions too.

          I could take that IA. Get some tests in this area to be more clear on this.

          If only for illustration of where you are focused, suggest you add to the doc

          Sure. Would also ensure the other points that were specifically discussed gets added to the doc.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Would be good to get some answers on his questions too. I could take that IA. Get some tests in this area to be more clear on this. If only for illustration of where you are focused, suggest you add to the doc Sure. Would also ensure the other points that were specifically discussed gets added to the doc.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          We discussed on that ServerCell concepts. But I would argue not do that because then the user would have two types of Cells - one on the write path and the other cell on the read path. I would say that would make things more complex and not much ease of use too.

          No here what I mean is the Cell interface wont change and at client side, the user will still interact with Cell. ServerCell is an extension for Cell which is only at server side. Both in read and write paths. Only thing is that the CP and Filter will get a new type then. ServerCell instead of Cell.

          Show
          anoop.hbase Anoop Sam John added a comment - We discussed on that ServerCell concepts. But I would argue not do that because then the user would have two types of Cells - one on the write path and the other cell on the read path. I would say that would make things more complex and not much ease of use too. No here what I mean is the Cell interface wont change and at client side, the user will still interact with Cell. ServerCell is an extension for Cell which is only at server side. Both in read and write paths. Only thing is that the CP and Filter will get a new type then. ServerCell instead of Cell.
          Hide
          stack stack added a comment -

          ramkrishna.s.vasudevan

          On this "But I would argue not do that because then the user would have two types of Cells - one on the write path and the other cell on the read path.", expecting clients to use BB in a wonky way is not on (smile). I think Anoop Sam John cleaned up what is meant.

          Show
          stack stack added a comment - ramkrishna.s.vasudevan On this "But I would argue not do that because then the user would have two types of Cells - one on the write path and the other cell on the read path.", expecting clients to use BB in a wonky way is not on (smile). I think Anoop Sam John cleaned up what is meant.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          You say about this case hbase.block.data.cachecompressed = true right? Yes this recently added feature allow to keep block data in compressed form in BC. When this block is read from BC, these happens
          Step 1 : We create a new on heap buffer and copy compressed data from buckets into it. Make an HFileBlock backed by this compressed data
          Step 2 : Unpack this block. Then we will create a new byte[] with size equal to the uncompresed data size for this block. The Compress algo will do uncompress of the block data into this new buffer

          As in our changes we avoid this step 1 new buffer and copy need. We create a block backed by MBB. We have new InputStream over MBB and pass that for uncomress. Yes Step 2 will be still there.
          So adv here also. Make some sense?

          Show
          anoop.hbase Anoop Sam John added a comment - You say about this case hbase.block.data.cachecompressed = true right? Yes this recently added feature allow to keep block data in compressed form in BC. When this block is read from BC, these happens Step 1 : We create a new on heap buffer and copy compressed data from buckets into it. Make an HFileBlock backed by this compressed data Step 2 : Unpack this block. Then we will create a new byte[] with size equal to the uncompresed data size for this block. The Compress algo will do uncompress of the block data into this new buffer As in our changes we avoid this step 1 new buffer and copy need. We create a block backed by MBB. We have new InputStream over MBB and pass that for uncomress. Yes Step 2 will be still there. So adv here also. Make some sense?
          Hide
          anoop.hbase Anoop Sam John added a comment -

          You mean for all these case, the data is already cached and come from BC? Or direct read from DFS?

          Show
          anoop.hbase Anoop Sam John added a comment - You mean for all these case, the data is already cached and come from BC? Or direct read from DFS?
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Attaching an E2E patch for reference. Still some more cleanups we are doing. Also avoiding some code duplication still in patch.

          Show
          anoop.hbase Anoop Sam John added a comment - Attaching an E2E patch for reference. Still some more cleanups we are doing. Also avoiding some code duplication still in patch.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Not able to add to RB. The RB tool hangs when we try to add a patch.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Not able to add to RB. The RB tool hangs when we try to add a patch.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Any comments/suggestions on the E2E patch attached. We are working on refining the patch such that some more cases where we need to clearly distinguish the getXXXArray and getXXXBB methods. Suggestions on the above patch would help us greatly.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Any comments/suggestions on the E2E patch attached. We are working on refining the patch such that some more cases where we need to clearly distinguish the getXXXArray and getXXXBB methods. Suggestions on the above patch would help us greatly.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Updated patch handling cases of ByteBuffers and byte[] through out read the path and some refactoring also done. Still may not be the final patch. But more suitable for reviews.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Updated patch handling cases of ByteBuffers and byte[] through out read the path and some refactoring also done. Still may not be the final patch. But more suitable for reviews.
          Show
          ram_krish ramkrishna.s.vasudevan added a comment - RB link https://reviews.apache.org/r/32687/
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708404/HBASE-11425.patch
          against master branch at commit f1f4b6618334767d0da0f47965309b21676e7e9f.
          ATTACHMENT ID: 12708404

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 246 new or modified tests.

          +1 hadoop versions. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0)

          -1 javac. The applied patch generated 63 javac compiler warnings (more than the master's current 46 warnings).

          +1 protoc. The applied patch does not increase the total number of protoc compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 14 warning messages.

          -1 checkstyle. The applied patch generated 2050 checkstyle errors (more than the master's current 1926 errors).

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces the following lines longer than 100:
          + ByteBufferUtils.copyFromBufferToByteArray(val, getValueBuffer(), getValueOffset(), 0, getValueLength());
          + ByteBufferUtils.copyFromBufferToByteArray(fam, getFamilyBuffer(), getFamilyOffset(), 0, getFamilyLength());
          + ByteBufferUtils.copyFromBufferToByteArray(qual, getQualifierBuffer(), getQualifierOffset(), 0, getQualifierLength());
          + ByteBufferUtils.copyFromBufferToByteArray(row, getRowBuffer(), getRowOffset(), 0, getRowLength());
          + ByteBufferUtils.copyFromBufferToByteArray(minimumMidpointArray, right, rightOffset, 0, diffIdx + 1);
          + ByteBufferUtils.copyFromBufferToByteArray(minimumMidpointArray, left, leftOffset, 0, diffIdx);
          + ByteBufferUtils.copyFromBufferToByteArray(minimumMidpointArray, right, rightOffset, 0, diffIdx + 1);
          + return matchingFamily(left, left.getFamilyOffset(), left.getFamilyLength(), buf, offset, length);
          + cell.getFamilyArray(), cell.getFamilyOffset(), cell.getFamilyLength(), cell.getQualifierArray(),
          + public static int findCommonPrefixInQualifierPart(Cell left, Cell right, int qualifierCommonPrefix) {

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//testReport/
          Release Findbugs (version 2.0.3) warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//artifact/patchprocess/newFindbugsWarnings.html
          Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//artifact/patchprocess/checkstyle-aggregate.html

          Javadoc warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//artifact/patchprocess/patchJavadocWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708404/HBASE-11425.patch against master branch at commit f1f4b6618334767d0da0f47965309b21676e7e9f. ATTACHMENT ID: 12708404 +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 246 new or modified tests. +1 hadoop versions . The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) -1 javac . The applied patch generated 63 javac compiler warnings (more than the master's current 46 warnings). +1 protoc . The applied patch does not increase the total number of protoc compiler warnings. -1 javadoc . The javadoc tool appears to have generated 14 warning messages. -1 checkstyle . The applied patch generated 2050 checkstyle errors (more than the master's current 1926 errors). +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces the following lines longer than 100: + ByteBufferUtils.copyFromBufferToByteArray(val, getValueBuffer(), getValueOffset(), 0, getValueLength()); + ByteBufferUtils.copyFromBufferToByteArray(fam, getFamilyBuffer(), getFamilyOffset(), 0, getFamilyLength()); + ByteBufferUtils.copyFromBufferToByteArray(qual, getQualifierBuffer(), getQualifierOffset(), 0, getQualifierLength()); + ByteBufferUtils.copyFromBufferToByteArray(row, getRowBuffer(), getRowOffset(), 0, getRowLength()); + ByteBufferUtils.copyFromBufferToByteArray(minimumMidpointArray, right, rightOffset, 0, diffIdx + 1); + ByteBufferUtils.copyFromBufferToByteArray(minimumMidpointArray, left, leftOffset, 0, diffIdx); + ByteBufferUtils.copyFromBufferToByteArray(minimumMidpointArray, right, rightOffset, 0, diffIdx + 1); + return matchingFamily(left, left.getFamilyOffset(), left.getFamilyLength(), buf, offset, length); + cell.getFamilyArray(), cell.getFamilyOffset(), cell.getFamilyLength(), cell.getQualifierArray(), + public static int findCommonPrefixInQualifierPart(Cell left, Cell right, int qualifierCommonPrefix) { +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//testReport/ Release Findbugs (version 2.0.3) warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//artifact/patchprocess/checkstyle-aggregate.html Javadoc warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13505//console This message is automatically generated.
          Hide
          stack stack added a comment -

          The patch is too big. I'm 7 pages in on a 13 page review. Could we piecemeal it?

          Show
          stack stack added a comment - The patch is too big. I'm 7 pages in on a 13 page review. Could we piecemeal it?
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Basically we have to split it into Sub patches.. This we posted so that
          there can be a look and understanding on the general approach.. I think
          you got to that Any way we got some good feedback.. We are working on
          that.. The ServerCell stuff am doing. In a day more I can put up some
          thing so that we can know what is the changes... Pls stay tuned..

          Anoop

          Show
          anoop.hbase Anoop Sam John added a comment - Basically we have to split it into Sub patches.. This we posted so that there can be a look and understanding on the general approach.. I think you got to that Any way we got some good feedback.. We are working on that.. The ServerCell stuff am doing. In a day more I can put up some thing so that we can know what is the changes... Pls stay tuned.. Anoop
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Yes Stack. We have plans to split up the patch. But the most challenging part is how to split it up and in what order?
          Few tasks like Cell API changes, Ref Counting, MultiByteBuffer class can all be added individually. But the rest are all inter wined. We can think of a strategy to do it.

          Thanks for having a look. We are parallel\y working on the review comments too.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Yes Stack. We have plans to split up the patch. But the most challenging part is how to split it up and in what order? Few tasks like Cell API changes, Ref Counting, MultiByteBuffer class can all be added individually. But the rest are all inter wined. We can think of a strategy to do it. Thanks for having a look. We are parallel\y working on the review comments too.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Also this patch would need certain public facing APIs to be deprecated and new ones to be added. That can be done as a seperate task too.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Also this patch would need certain public facing APIs to be deprecated and new ones to be added. That can be done as a seperate task too.
          Hide
          stack stack added a comment -

          So, what is going on with this patch now? You want the CellComparator patch in first HBASE-10800? Let me look at ServerCell patch too.

          Show
          stack stack added a comment - So, what is going on with this patch now? You want the CellComparator patch in first HBASE-10800 ? Let me look at ServerCell patch too.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Yes, I would say we deal with HBASE-10800 first and then the ServerCell and then come to this Jira.

          Show
          anoop.hbase Anoop Sam John added a comment - Yes, I would say we deal with HBASE-10800 first and then the ServerCell and then come to this Jira.
          Hide
          stack stack added a comment -

          Took another read of the doc (and above comments on it).

          1. (Continuing from comments above), suggest adding to the doc estimation of how many extra objects will be made going this route and Vladimir's grid to show what you are focused on.
          2. Did you fellas have a look at how others do offheaping or if there were libs you could have made use of? Would have been good to include notes on your findings in here.
          3. The section on hasArray (if hasArray is false, it seems to imply hasByteBuffer is true) and the discussion of added APIs and when they come into effect and when they throw unsupported exceptions will need a rewrite in light of feedback above and review of recent patches (API method names I think we've cleaned up too).
          4. Sounds like you fellas looked at netty ByteBuf too. Add in your findings I'd say.
          5. Would have liked to have more detail around the RPC findings. You think it could be different now we have buffer reuse? We could save making a cellblock?
          6. Looking at diagrams for perf, I can't tell if more is better or not. Suggest you write up a summary of what the diagrams are showing.
          7. This feature when on, will be for whole server, right? Can't do by table or region, right?

          Thanks lads.

          Show
          stack stack added a comment - Took another read of the doc (and above comments on it). (Continuing from comments above), suggest adding to the doc estimation of how many extra objects will be made going this route and Vladimir's grid to show what you are focused on. Did you fellas have a look at how others do offheaping or if there were libs you could have made use of? Would have been good to include notes on your findings in here. The section on hasArray (if hasArray is false, it seems to imply hasByteBuffer is true) and the discussion of added APIs and when they come into effect and when they throw unsupported exceptions will need a rewrite in light of feedback above and review of recent patches (API method names I think we've cleaned up too). Sounds like you fellas looked at netty ByteBuf too. Add in your findings I'd say. Would have liked to have more detail around the RPC findings. You think it could be different now we have buffer reuse? We could save making a cellblock? Looking at diagrams for perf, I can't tell if more is better or not. Suggest you write up a summary of what the diagrams are showing. This feature when on, will be for whole server, right? Can't do by table or region, right? Thanks lads.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          We will update all the other things as you have said in the doc.

          You think it could be different now we have buffer reuse? We could save making a cellblock?

          We can try once again with buffers reuse. But still writing multiple cells individually to the socket was the time taking factor which was reduced when we were creating a cell block.
          Will come back to the other comments shortly. Thanks Stack.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - We will update all the other things as you have said in the doc. You think it could be different now we have buffer reuse? We could save making a cellblock? We can try once again with buffers reuse. But still writing multiple cells individually to the socket was the time taking factor which was reduced when we were creating a cell block. Will come back to the other comments shortly. Thanks Stack .
          Hide
          stack stack added a comment -

          Anoop Sam John The write up is excellent – especially the bit where you dumb it down listing out conclusion at end of each test and do up the table with all results. Suggest you do an edit, add a conclusion, and post the jmh files on this issue (I could not open them from the doc) rather than in the doc, and then post a note to dev list pointing at these findings since we are going to build on top of them going forward (we should put it on hbase blog?). Very nice.

          Here are some comments on the doc that might help w/ the edit.

          Suggest you add sentence after first on why we want to go offheap (this will make the doc 'standalone')

          Change: "When we support E2E off heap support and in turn support Cell also backed by off heap memory, we have to make sure to select the best performing data structure/framework for this off heap storage."

          to

          "When we implement E2E off heap support, we have to make sure to select the best performing data structure/framework."

          s/below test is/below tests are/

          s/pros like, our PRC layer/pros such as our RPC layer already/

          s/HDFS/an HDFS/

          Change "But the NIO Buffer APIs have complaints over its performance and methods not inlineable"

          to

          But NIO ByteBuffers can be slow (boundary checks and/or some methods may not inline).

          Change: "This make us to think for netty also as it seems better performing. (Really? We have test results below)"

          to

          "This makes us look to Netty ByteBuf as a possibly better performing alternative."

          On first test, add sentence comparing difference between onheap and offheap runs (onheap looks 30% slower compared to onheap?) Do same comparing jdk8 to jdk7? (Hmm... yeah, would like to see the code... smile)

          s/Similar way/Similarly/

          s/The next test cmopare/The next test compares/

          s/come almost/come out almost/

          s/test what I/test that I/

          Show
          stack stack added a comment - Anoop Sam John The write up is excellent – especially the bit where you dumb it down listing out conclusion at end of each test and do up the table with all results. Suggest you do an edit, add a conclusion, and post the jmh files on this issue (I could not open them from the doc) rather than in the doc, and then post a note to dev list pointing at these findings since we are going to build on top of them going forward (we should put it on hbase blog?). Very nice. Here are some comments on the doc that might help w/ the edit. Suggest you add sentence after first on why we want to go offheap (this will make the doc 'standalone') Change: "When we support E2E off heap support and in turn support Cell also backed by off heap memory, we have to make sure to select the best performing data structure/framework for this off heap storage." to "When we implement E2E off heap support, we have to make sure to select the best performing data structure/framework." s/below test is/below tests are/ s/pros like, our PRC layer/pros such as our RPC layer already/ s/HDFS/an HDFS/ Change "But the NIO Buffer APIs have complaints over its performance and methods not inlineable" to But NIO ByteBuffers can be slow (boundary checks and/or some methods may not inline). Change: "This make us to think for netty also as it seems better performing. (Really? We have test results below)" to "This makes us look to Netty ByteBuf as a possibly better performing alternative." On first test, add sentence comparing difference between onheap and offheap runs (onheap looks 30% slower compared to onheap?) Do same comparing jdk8 to jdk7? (Hmm... yeah, would like to see the code... smile) s/Similar way/Similarly/ s/The next test cmopare/The next test compares/ s/come almost/come out almost/ s/test what I/test that I/
          Hide
          stack stack added a comment -

          Anoop Sam John and ramkrishna.s.vasudevan as per chat this morning, post the original word doc and I'll stick it up in a shared google doc so we can all comment/bang on it... or if you are able, you post it as a google doc. Thanks lads.

          Show
          stack stack added a comment - Anoop Sam John and ramkrishna.s.vasudevan as per chat this morning, post the original word doc and I'll stick it up in a shared google doc so we can all comment/bang on it... or if you are able, you post it as a google doc. Thanks lads.
          Show
          anoop.hbase Anoop Sam John added a comment - https://docs.google.com/document/d/1WHLYmccHw28itox4qdeXRgH5SZkt_zruSzngcmmBiUs/edit?usp=sharing
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in HBase-TRUNK #6701 (See https://builds.apache.org/job/HBase-TRUNK/6701/)
          HBASE-14188 - Read path optimizations after HBASE-11425 profiling (Ram) (ramkrishna: rev 7a9e10dc11877420c53245c403897d746bebc077)

          • hbase-common/src/main/java/org/apache/hadoop/hbase/OffheapKeyValue.java
          • hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderImpl.java
          • hbase-common/src/test/java/org/apache/hadoop/hbase/TestOffheapKeyValue.java
          • hbase-server/src/test/java/org/apache/hadoop/hbase/filter/FilterAllFilter.java
          • hbase-server/src/main/java/org/apache/hadoop/hbase/SizeCachedNoTagsKeyValue.java
          • hbase-server/src/main/java/org/apache/hadoop/hbase/SizeCachedKeyValue.java
          • hbase-common/src/main/java/org/apache/hadoop/hbase/nio/MultiByteBuff.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in HBase-TRUNK #6701 (See https://builds.apache.org/job/HBase-TRUNK/6701/ ) HBASE-14188 - Read path optimizations after HBASE-11425 profiling (Ram) (ramkrishna: rev 7a9e10dc11877420c53245c403897d746bebc077) hbase-common/src/main/java/org/apache/hadoop/hbase/OffheapKeyValue.java hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderImpl.java hbase-common/src/test/java/org/apache/hadoop/hbase/TestOffheapKeyValue.java hbase-server/src/test/java/org/apache/hadoop/hbase/filter/FilterAllFilter.java hbase-server/src/main/java/org/apache/hadoop/hbase/SizeCachedNoTagsKeyValue.java hbase-server/src/main/java/org/apache/hadoop/hbase/SizeCachedKeyValue.java hbase-common/src/main/java/org/apache/hadoop/hbase/nio/MultiByteBuff.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in HBase-TRUNK #6717 (See https://builds.apache.org/job/HBase-TRUNK/6717/)
          HBASE-14188- Read path optimizations after HBASE-11425 profiling- (ramkrishna: rev aa3538f80278f5c0ba1bc8ca903066fa02ac79ec)

          • hbase-server/src/test/java/org/apache/hadoop/hbase/filter/FilterAllFilter.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in HBase-TRUNK #6717 (See https://builds.apache.org/job/HBase-TRUNK/6717/ ) HBASE-14188 - Read path optimizations after HBASE-11425 profiling- (ramkrishna: rev aa3538f80278f5c0ba1bc8ca903066fa02ac79ec) hbase-server/src/test/java/org/apache/hadoop/hbase/filter/FilterAllFilter.java
          Hide
          stack stack added a comment -

          Some coarse graphs that run YCSB workload c (total random read) running for an hour with 100 clients against a dataset that is totally cached hosted on one server. The first run is against a RS that is using default, onheap memcache. The second is using bucketcache.

          I see that the work here makes it so using the bucketcache has the same latency and throughput (perhaps a little less throughput) as serving all from onheap (recall that in tests, buckecache as best if there were cache misses... if you could serve all from heap, onheap had a much nicer profile). To me, this makes it possible to run with the bucketcache all the time whether serving all from heap or when cache misses (recall, bucketcache did better when there were cache misses – I have not looked to see if this work improves on what we saw previous).

          More testing to follow (a redo of our block cache comparisions post might be in order).

          The graphs are gc basic profile (this is CMS), gets per second, the median (the 75th and 95th percentiles weren't showing up for some reason... need to dig in... hopefully its because their incidence was low...), and overall loading and seeks.

          Offheap puts al little more load on the system, has a better GC profile, and is slightly less throughput.

          Show
          stack stack added a comment - Some coarse graphs that run YCSB workload c (total random read) running for an hour with 100 clients against a dataset that is totally cached hosted on one server. The first run is against a RS that is using default, onheap memcache. The second is using bucketcache. I see that the work here makes it so using the bucketcache has the same latency and throughput (perhaps a little less throughput) as serving all from onheap (recall that in tests, buckecache as best if there were cache misses... if you could serve all from heap, onheap had a much nicer profile). To me, this makes it possible to run with the bucketcache all the time whether serving all from heap or when cache misses (recall, bucketcache did better when there were cache misses – I have not looked to see if this work improves on what we saw previous). More testing to follow (a redo of our block cache comparisions post might be in order). The graphs are gc basic profile (this is CMS), gets per second, the median (the 75th and 95th percentiles weren't showing up for some reason... need to dig in... hopefully its because their incidence was low...), and overall loading and seeks. Offheap puts al little more load on the system, has a better GC profile, and is slightly less throughput.
          Hide
          stack stack added a comment -

          Have you fellas run this for a while? I seem to OOME easy enough. I tried with a small heap, 4G, but it OOME'd then had to keep going back up to 16G again though I had a big offheap. I tried to capture the increasing use of heap but this diagram is best I got... I've not done heap analysis.. but in the diagram you can see heap use start to rise and then plummet... now I am crawling doing Full GCs every couple of seconds.

          The last meaningful log was this in regionserver log:

          2015-10-14 16:51:10,058 DEBUG [main-BucketCacheWriter-2] bucket.BucketCache: This block 3f0157e7daee45fdb25202c496c95c46_1649898813 is still referred by 1 readers. Can not be freed now

          It started complaining 50minutes ago...

          Show
          stack stack added a comment - Have you fellas run this for a while? I seem to OOME easy enough. I tried with a small heap, 4G, but it OOME'd then had to keep going back up to 16G again though I had a big offheap. I tried to capture the increasing use of heap but this diagram is best I got... I've not done heap analysis.. but in the diagram you can see heap use start to rise and then plummet... now I am crawling doing Full GCs every couple of seconds. The last meaningful log was this in regionserver log: 2015-10-14 16:51:10,058 DEBUG [main-BucketCacheWriter-2] bucket.BucketCache: This block 3f0157e7daee45fdb25202c496c95c46_1649898813 is still referred by 1 readers. Can not be freed now It started complaining 50minutes ago...
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Have you fellas run this for a while? I seem to OOME easy enough. I tried with a small heap, 4G, but it OOME'd then had to keep going back up to 16G again though I had a big offheap. I tried to capture the increasing use of heap but this diagram is best I got... I've not done heap analysis.. but in the diagram you can see heap use start to rise and then plummet... now I am crawling doing Full GCs every couple of seconds.

          Argh!! We have run some read related workloads for an hour or so but did not observe any OOMEs. But our heap size was set at 32G. This is the case with YCSB and did not observe any full GCs.
          In case of PE tool we only used 9G heap size and 16G offheap size. But did not observe any fluctuations in the heap usage.

          2015-10-14 16:51:10,058 DEBUG [main-BucketCacheWriter-2] bucket.BucketCache: This block 3f0157e7daee45fdb25202c496c95c46_1649898813 is still referred by 1 readers. Can not be freed now

          May be because of OOME some error handling is not done. Let us internally install a cluster to test this.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Have you fellas run this for a while? I seem to OOME easy enough. I tried with a small heap, 4G, but it OOME'd then had to keep going back up to 16G again though I had a big offheap. I tried to capture the increasing use of heap but this diagram is best I got... I've not done heap analysis.. but in the diagram you can see heap use start to rise and then plummet... now I am crawling doing Full GCs every couple of seconds. Argh!! We have run some read related workloads for an hour or so but did not observe any OOMEs. But our heap size was set at 32G. This is the case with YCSB and did not observe any full GCs. In case of PE tool we only used 9G heap size and 16G offheap size. But did not observe any fluctuations in the heap usage. 2015-10-14 16:51:10,058 DEBUG [main-BucketCacheWriter-2] bucket.BucketCache: This block 3f0157e7daee45fdb25202c496c95c46_1649898813 is still referred by 1 readers. Can not be freed now May be because of OOME some error handling is not done. Let us internally install a cluster to test this.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          BTW - this behaviour are you seeing while doing pure read workload? ie. YCSB workload c?

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - BTW - this behaviour are you seeing while doing pure read workload? ie. YCSB workload c?
          Hide
          stack stack added a comment -

          Workload c. Try 4g. I don't see why 4g should not be enough when 7//regions and 8g of offheap and all cache hits. I could be wrong. I have not dug in...

          Show
          stack stack added a comment - Workload c. Try 4g. I don't see why 4g should not be enough when 7//regions and 8g of offheap and all cache hits. I could be wrong. I have not dug in...
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Tried out the experiments in a new cluster with single node.
          Loaded around 15G of data with 10 regions.
          Initially configured 4G as heap space, offheap space as 5G and the bucket cache size as 4G.
          Ran a pure read workload c for 30 mins. With 50 threads. I am not running into OOME and also the block eviction part is fine. The bigger GCs are around 450ms to 500ms. Repeated the same experiment with 10 G of heap space also. With this configuration we are sure that evictions are happening from the bucket cache. Attaching a GC snapshot for 5 mins captured during the workload test.
          Stack,
          Also in your experiment I think your data does not fit into the bucket cache and hence it is trying to evict. Or if it is fitting into the bucket cache, probably that was a file that was trying to get compacted and there was a reader referencing to it and due to the OOME the ref count decrement did not happen and the forceful eviction was failing. Will keep checking this. Any logs can you attach when this happened? Easier to debug (hopefully).

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Tried out the experiments in a new cluster with single node. Loaded around 15G of data with 10 regions. Initially configured 4G as heap space, offheap space as 5G and the bucket cache size as 4G. Ran a pure read workload c for 30 mins. With 50 threads. I am not running into OOME and also the block eviction part is fine. The bigger GCs are around 450ms to 500ms. Repeated the same experiment with 10 G of heap space also. With this configuration we are sure that evictions are happening from the bucket cache. Attaching a GC snapshot for 5 mins captured during the workload test. Stack, Also in your experiment I think your data does not fit into the bucket cache and hence it is trying to evict. Or if it is fitting into the bucket cache, probably that was a file that was trying to get compacted and there was a reader referencing to it and due to the OOME the ref count decrement did not happen and the forceful eviction was failing. Will keep checking this. Any logs can you attach when this happened? Easier to debug (hopefully).
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Just checking the code

          Or if it is fitting into the bucket cache, probably that was a file that was trying to get compacted and there was a reader referencing to it and due to the OOME the ref count decrement did not happen and the forceful eviction was failing.

          This should not be the case. Any way logs will be better. Trying to reproduce this case if possible.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Just checking the code Or if it is fitting into the bucket cache, probably that was a file that was trying to get compacted and there was a reader referencing to it and due to the OOME the ref count decrement did not happen and the forceful eviction was failing. This should not be the case. Any way logs will be better. Trying to reproduce this case if possible.
          Hide
          carp84 Yu Li added a comment -

          From the "Offheap reads in HBase using BBs_V2.pdf" doc, we could see a kind of big perf gap between BucketCache and LRUCache, excerpt as below:

          Multiple Gets with 25 threads
          
                          AvgRT     95th   99th
          BucketCache     111.81    130    133
          LRUCache        23.49     34     39
          

          But from the latest comments here it seems this data is out of date, right? Mind update the doc with latest perf data? Thanks.

          Show
          carp84 Yu Li added a comment - From the "Offheap reads in HBase using BBs_V2.pdf" doc, we could see a kind of big perf gap between BucketCache and LRUCache, excerpt as below: Multiple Gets with 25 threads AvgRT 95th 99th BucketCache 111.81 130 133 LRUCache 23.49 34 39 But from the latest comments here it seems this data is out of date, right? Mind update the doc with latest perf data? Thanks.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Yes. We can . We will test once with LRU and bucket cache and see how much diff we have. Thanks.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Yes. We can . We will test once with LRU and bucket cache and see how much diff we have. Thanks.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment - - edited

          I could run with even 2G of heap without OOME.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - - edited I could run with even 2G of heap without OOME.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Yu Li
          I have done a quick multi get test using PE tool default settings (ie. 1 GB data)
          Multi get with 100 rows and 25 client threads and each doing the op 100000 times.
          Avg completion time for each thread
          On heap LRU Cache (L1) : 9492ms
          Off heap Bucket Cache (L2): 9596ms
          So it is almost same.

          Show
          anoop.hbase Anoop Sam John added a comment - Yu Li I have done a quick multi get test using PE tool default settings (ie. 1 GB data) Multi get with 100 rows and 25 client threads and each doing the op 100000 times. Avg completion time for each thread On heap LRU Cache (L1) : 9492ms Off heap Bucket Cache (L2): 9596ms So it is almost same.
          Hide
          carp84 Yu Li added a comment -

          Thanks for the update Anoop Sam John, the number looks great and nice work!

          It's a pity that we have to wait until 2.0 released to take this advantage, maybe months away...

          Show
          carp84 Yu Li added a comment - Thanks for the update Anoop Sam John , the number looks great and nice work! It's a pity that we have to wait until 2.0 released to take this advantage, maybe months away...
          Hide
          stack stack added a comment -

          Seems easy for me to reproduce. Just happened again. Here is the log.

          Show
          stack stack added a comment - Seems easy for me to reproduce. Just happened again. Here is the log.
          Hide
          stack stack added a comment -

          The log is not that interesting. It does not even have the eviction issue. Just shows us struggling w/ GC. Finally we OOME here:

          2015-10-16T13:16:42.032-0700: [Full GC (Allocation Failure) 2015-10-16T13:16:42.032-0700: [CMS: 15276447K->15276422K(15276480K), 4.9700400 secs] 16273215K->16273119K(16273280K), [Metaspace: 48934K->48934K(1093632K)], 4.9774537 secs] [Times: user=4.97 sys=0.00, real=4.98 secs]
          2015-10-16T13:16:47.012-0700: [Full GC (Allocation Failure) 2015-10-16T13:16:47.012-0700: [CMS: 15276422K->15276422K(15276480K), 1.2150090 secs] 16273204K->16273186K(16273280K), [Metaspace: 48901K->48901K(1093632K)], 1.2151393 secs] [Times: user=1.21 sys=0.00, real=1.22 secs]
          2015-10-16T13:16:48.227-0700: [Full GC (Allocation Failure) 2015-10-16T13:16:48.227-0700: [CMS: 15276422K->15276422K(15276480K), 1.2185531 secs] 16273186K->16273186K(16273280K), [Metaspace: 48901K->48901K(1093632K)], 1.2186671 secs] [Times: user=1.22 sys=0.00, real=1.21 secs]
          #

          1. java.lang.OutOfMemoryError: Java heap space
          2. -XX:OnOutOfMemoryError="kill -9 %p"
          3. Executing /bin/sh -c "kill -9 16941"...

          Configs are this:

          1. The maximum amount of heap to use, in MB. Default is 1000.
            export HBASE_HEAPSIZE=16000

          export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=16g"

          <property>
          <name>hbase.bucketcache.ioengine</name>
          <value>offheap</value>
          </property>
          <property>
          <name>hbase.bucketcache.size</name>
          <value>8196</value>
          </property>
          <property>
          <name>hfile.block.cache.size</name>
          <value>0.1</value>
          </property>
          </configuration>

          Looking at UI, hardly any meta blocks in L1... a couple of hundred.

          Show
          stack stack added a comment - The log is not that interesting. It does not even have the eviction issue. Just shows us struggling w/ GC. Finally we OOME here: 2015-10-16T13:16:42.032-0700: [Full GC (Allocation Failure) 2015-10-16T13:16:42.032-0700: [CMS: 15276447K->15276422K(15276480K), 4.9700400 secs] 16273215K->16273119K(16273280K), [Metaspace: 48934K->48934K(1093632K)] , 4.9774537 secs] [Times: user=4.97 sys=0.00, real=4.98 secs] 2015-10-16T13:16:47.012-0700: [Full GC (Allocation Failure) 2015-10-16T13:16:47.012-0700: [CMS: 15276422K->15276422K(15276480K), 1.2150090 secs] 16273204K->16273186K(16273280K), [Metaspace: 48901K->48901K(1093632K)] , 1.2151393 secs] [Times: user=1.21 sys=0.00, real=1.22 secs] 2015-10-16T13:16:48.227-0700: [Full GC (Allocation Failure) 2015-10-16T13:16:48.227-0700: [CMS: 15276422K->15276422K(15276480K), 1.2185531 secs] 16273186K->16273186K(16273280K), [Metaspace: 48901K->48901K(1093632K)] , 1.2186671 secs] [Times: user=1.22 sys=0.00, real=1.21 secs] # java.lang.OutOfMemoryError: Java heap space -XX:OnOutOfMemoryError="kill -9 %p" Executing /bin/sh -c "kill -9 16941"... Configs are this: The maximum amount of heap to use, in MB. Default is 1000. export HBASE_HEAPSIZE=16000 export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=16g" <property> <name>hbase.bucketcache.ioengine</name> <value>offheap</value> </property> <property> <name>hbase.bucketcache.size</name> <value>8196</value> </property> <property> <name>hfile.block.cache.size</name> <value>0.1</value> </property> </configuration> Looking at UI, hardly any meta blocks in L1... a couple of hundred.
          Hide
          stack stack added a comment -

          My master branch is at this stage:

          commit 1458798eb593358fe5415596b2958f2f7e451ea5
          Author: stack <stack@apache.org>
          Date: Tue Oct 13 15:16:57 2015 -0700

          HBASE-14600 Make #testWalRollOnLowReplication looser still

          Show
          stack stack added a comment - My master branch is at this stage: commit 1458798eb593358fe5415596b2958f2f7e451ea5 Author: stack <stack@apache.org> Date: Tue Oct 13 15:16:57 2015 -0700 HBASE-14600 Make #testWalRollOnLowReplication looser still
          Hide
          stack stack added a comment -

          My master branch is at this stage:

          commit 1458798eb593358fe5415596b2958f2f7e451ea5
          Author: stack <stack@apache.org>
          Date: Tue Oct 13 15:16:57 2015 -0700

          HBASE-14600 Make #testWalRollOnLowReplication looser still

          Show
          stack stack added a comment - My master branch is at this stage: commit 1458798eb593358fe5415596b2958f2f7e451ea5 Author: stack <stack@apache.org> Date: Tue Oct 13 15:16:57 2015 -0700 HBASE-14600 Make #testWalRollOnLowReplication looser still
          Hide
          stack stack added a comment -

          Does this help? Looks like 7.5G retained by HFileScannerImpl in arraylist.

          Show
          stack stack added a comment - Does this help? Looks like 7.5G retained by HFileScannerImpl in arraylist.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          Is the log at the time of workload C (pure reads alone)? Because I can see lot of flushes and compaction logs in it. Exact steps to reproduce can u give boss? Will help us to reproduce and fix.

          What we do is
          Start the cluster and pump in whole data. Then stopping the cluster so that whole data is flushed and in disk
          Now start cluster again. Make a full table scan so that the entire data is read once and getting cached to BC
          Now run the YCSB workload C

          Show
          anoop.hbase Anoop Sam John added a comment - Is the log at the time of workload C (pure reads alone)? Because I can see lot of flushes and compaction logs in it. Exact steps to reproduce can u give boss? Will help us to reproduce and fix. What we do is Start the cluster and pump in whole data. Then stopping the cluster so that whole data is flushed and in disk Now start cluster again. Make a full table scan so that the entire data is read once and getting cached to BC Now run the YCSB workload C
          Hide
          stack stack added a comment -

          This is a different loading. This is me trying to run a suite of ycsb load and workloads but I'd forgotten to disable bucketcache.

          It should not OOME. If it does, its broke?

          See the attached image for what is holding objects.

          Show
          stack stack added a comment - This is a different loading. This is me trying to run a suite of ycsb load and workloads but I'd forgotten to disable bucketcache. It should not OOME. If it does, its broke? See the attached image for what is holding objects.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment - - edited

          In the latest tests that we did, ensured that after the loading was done we try to simply run the ycsb read workloads and ensure that we load the block cache and at the same time there is lot of evicitons as the bucket cache configured is lesser in size than the available data.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - - edited In the latest tests that we did, ensured that after the loading was done we try to simply run the ycsb read workloads and ensure that we load the block cache and at the same time there is lot of evicitons as the bucket cache configured is lesser in size than the available data.
          Hide
          anoop.hbase Anoop Sam John added a comment -

          One doubt after seeing the logs... The OOME comes during a time when some compaction happens?
          There was one open item we discussed after the shipped() call during the compaction.. This API gets called during the scan flow after we ship a set of rows back to client.. So all the blocks other than the cur block we came across during this scan, can get released. (Ref count decrements).
          But during compaction this call is not at all happening and only at the end one close happens. We can correct this.. I have a quick patch for that. U will be interested to see it and test it with once boss?

          Show
          anoop.hbase Anoop Sam John added a comment - One doubt after seeing the logs... The OOME comes during a time when some compaction happens? There was one open item we discussed after the shipped() call during the compaction.. This API gets called during the scan flow after we ship a set of rows back to client.. So all the blocks other than the cur block we came across during this scan, can get released. (Ref count decrements). But during compaction this call is not at all happening and only at the end one close happens. We can correct this.. I have a quick patch for that. U will be interested to see it and test it with once boss?
          Hide
          anoop.hbase Anoop Sam John added a comment -

          It should not OOME. If it does, its broke?

          Agree fully.. We need fix this and it is a critical bug. Was just trying to understand flow so that we can reproduce it easily here

          Show
          anoop.hbase Anoop Sam John added a comment - It should not OOME. If it does, its broke? Agree fully.. We need fix this and it is a critical bug. Was just trying to understand flow so that we can reproduce it easily here
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Moved out the Dictionary (HBASE-14841) and Prefix Tree (HBASE-14842) to work with ByteBuffer to HBASE-15179 as it is mostly concerned with writes. Hence resolving this parent JIRA as fixed.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Moved out the Dictionary ( HBASE-14841 ) and Prefix Tree ( HBASE-14842 ) to work with ByteBuffer to HBASE-15179 as it is mostly concerned with writes. Hence resolving this parent JIRA as fixed.
          Hide
          stack stack added a comment -

          Hurray!

          Show
          stack stack added a comment - Hurray!

            People

            • Assignee:
              anoopsamjohn Anoop Sam John
              Reporter:
              anoop.hbase Anoop Sam John
            • Votes:
              0 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development