Details

    • Type: Umbrella Umbrella
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Even with the improved G1 GC in Java 7, Java processes that want to address large regions of memory while also providing low high-percentile latencies continue to be challenged. Fundamentally, a Java server process that has high data throughput and also tight latency SLAs will be stymied by the fact that the JVM does not provide a fully concurrent collector. There is simply not enough throughput to copy data during GC under safepoint (all application threads suspended) within available time bounds. This is increasingly an issue for HBase users operating under dual pressures: 1. tight response SLAs, 2. the increasing amount of RAM available in "commodity" server configurations, because GC load is roughly proportional to heap size.

      We can address this using parallel strategies. We should talk with the Java platform developer community about the possibility of a fully concurrent collector appearing in OpenJDK somehow. Set aside the question of if this is too little too late, if one becomes available the benefit will be immediate though subject to qualification for production, and transparent in terms of code changes. However in the meantime we need an answer for Java versions already in production. This requires we move the large arena allocations off heap, those being the blockcache and memstore. On other JIRAs recently there has been related discussion about combining the blockcache and memstore (HBASE-9399) and on flushing memstore into blockcache (HBASE-5311), which is related work. We should build off heap allocation for memstore and blockcache, perhaps a unified pool for both, and plumb through zero copy direct access to these allocations (via direct buffers) through the read and write I/O paths. This may require the construction of classes that provide object views over data contained within direct buffers. This is something else we could talk with the Java platform developer community about - it could be possible to provide language level object views over off heap memory, on heap objects could hold references to objects backed by off heap memory but not vice versa, maybe facilitated by new intrinsics in Unsafe. Again we need an answer for today also. We should investigate what existing libraries may be available in this regard. Key will be avoiding marshalling/unmarshalling costs. At most we should be copying primitives out of the direct buffers to register or stack locations until finally copying data to construct protobuf Messages. A related issue there is HBASE-9794, which proposes scatter-gather access to KeyValues when constructing RPC messages. We should see how far we can get with that and also zero copy construction of protobuf Messages backed by direct buffer allocations. Some amount of native code may be required.

        Issue Links

          Activity

          Hide
          Nick Dimiduk added a comment -

          Apparently you're reading my mind

          Nicely articulated Andrew Purtell. I'd like to see a body of evidence that points to specific components which make meaningful sense for moving off-heap. Memstore and BlockCache are commonly cited as the offending components, but I've not seen anyone present conclusive profiling results making this clear. Nor is there clear advice regarding at what point a heap becomes too large. I've started work to track down some read data here on both of these points before pressing forward with recommendations.

          See also Nicolas Liochon's recent profiling work reducing the GC burden imposed by the protobuf RPC implementation. This is an example where a major offender isn't on the above short-list. I am excited work toward and experiment with an entirely off-heap data flow, at least for the read path (HDFS -> BlockCache -> RPC send buffer)!

          Show
          Nick Dimiduk added a comment - Apparently you're reading my mind Nicely articulated Andrew Purtell . I'd like to see a body of evidence that points to specific components which make meaningful sense for moving off-heap. Memstore and BlockCache are commonly cited as the offending components, but I've not seen anyone present conclusive profiling results making this clear. Nor is there clear advice regarding at what point a heap becomes too large. I've started work to track down some read data here on both of these points before pressing forward with recommendations. See also Nicolas Liochon 's recent profiling work reducing the GC burden imposed by the protobuf RPC implementation. This is an example where a major offender isn't on the above short-list. I am excited work toward and experiment with an entirely off-heap data flow, at least for the read path (HDFS -> BlockCache -> RPC send buffer)!
          Hide
          Andrew Purtell added a comment - - edited

          Memstore and BlockCache are commonly cited as the offending components, but I've not seen anyone present conclusive profiling results making this clear

          It's abundantly clear once using heaps larger than ~8 GB that collection pauses under safepoint blow out latency SLAs at the high percentiles. Why would we need heaps larger than this? To take direct advantage of large server RAM. Memstore and blockcache are then the largest allocators of heap memory. If we move them off heap, they can "soak up" most of the available RAM, leaving remaining heap demand relatively small - this is the idea.

          Edit: Phrasing

          Show
          Andrew Purtell added a comment - - edited Memstore and BlockCache are commonly cited as the offending components, but I've not seen anyone present conclusive profiling results making this clear It's abundantly clear once using heaps larger than ~8 GB that collection pauses under safepoint blow out latency SLAs at the high percentiles. Why would we need heaps larger than this? To take direct advantage of large server RAM. Memstore and blockcache are then the largest allocators of heap memory. If we move them off heap, they can "soak up" most of the available RAM, leaving remaining heap demand relatively small - this is the idea. Edit: Phrasing
          Hide
          Vladimir Rodionov added a comment -

          This will require the whole data flow redesign in HBase. Currently, the minimum (and maximum) data exchange element in HBase's internal pipeline is KeyValue, which is heavy, on-heap (byte array backed) data structure. Moving data allocations to off-heap is a half a problem, another one is how to avoid copy-data-on-read and copy data on write (from/to off heap). Serialization is quite expensive.

          Show
          Vladimir Rodionov added a comment - This will require the whole data flow redesign in HBase. Currently, the minimum (and maximum) data exchange element in HBase's internal pipeline is KeyValue, which is heavy, on-heap (byte array backed) data structure. Moving data allocations to off-heap is a half a problem, another one is how to avoid copy-data-on-read and copy data on write (from/to off heap). Serialization is quite expensive.
          Hide
          Vladimir Rodionov added a comment -

          It's abundantly clear once using heaps larger than ~8 GB that collection pauses under safepoint blow out latency SLAs at the high percentiles.

          What HBase version are you using? No bucket cache yet?

          Show
          Vladimir Rodionov added a comment - It's abundantly clear once using heaps larger than ~8 GB that collection pauses under safepoint blow out latency SLAs at the high percentiles. What HBase version are you using? No bucket cache yet?
          Hide
          Andrew Purtell added a comment -

          What HBase version are you using? No bucket cache yet?

          Trunk, what is now 0.98.

          As you point out above, serialization/deserialization costs limit the bucket cache, which is why I propose the goal of direct operation on allocations backed by off-heap memory. This has to be approached in stages.

          The bucket cache encourages looking at this approach. Although you'll see reduced throughput, it will smooth out the latency tail and allow the blockcache to address RAM without increasing heap size, which also helps smooth out the latency tail with respect to collection pause distribution. However, using large heaps e.g. 128+ GB mixed generation collections exceeding the ZooKeeper heartbeat timeout are inevitable under mixed read+write load, nothing mitigates that sufficiently that I have found.

          Show
          Andrew Purtell added a comment - What HBase version are you using? No bucket cache yet? Trunk, what is now 0.98. As you point out above, serialization/deserialization costs limit the bucket cache, which is why I propose the goal of direct operation on allocations backed by off-heap memory. This has to be approached in stages. The bucket cache encourages looking at this approach. Although you'll see reduced throughput, it will smooth out the latency tail and allow the blockcache to address RAM without increasing heap size, which also helps smooth out the latency tail with respect to collection pause distribution. However, using large heaps e.g. 128+ GB mixed generation collections exceeding the ZooKeeper heartbeat timeout are inevitable under mixed read+write load, nothing mitigates that sufficiently that I have found.
          Hide
          Vladimir Rodionov added a comment -

          The bucket cache encourages looking at this approach. Although you'll see reduced throughput, it will smooth out the latency tail and allow the blockcache to address RAM without increasing heap size, which also helps smooth out the latency tail with respect to collection pause distribution. However, using large heaps e.g. 128+ GB mixed generation collections exceeding the ZooKeeper heartbeat timeout are inevitable under mixed read+write load, nothing mitigates that sufficiently that I have found.

          It looks like you have done some bucket cache research and tests. Are there any numbers available? We are considering upgrading to 0.96 release and bucket cache is the major attraction for us. According to you, its not that usable or it does not give any performance advantage? I really doubt, that 80GB on heap block cache is viable alternative to off heap cache in mixed read/write load scenario even in Java7 with G1.

          One thing to note: having serialization barrier has one huge advantage over direct off heap access. You can compress blocks in off heap. For our application compression ratio is close to 4.

          Show
          Vladimir Rodionov added a comment - The bucket cache encourages looking at this approach. Although you'll see reduced throughput, it will smooth out the latency tail and allow the blockcache to address RAM without increasing heap size, which also helps smooth out the latency tail with respect to collection pause distribution. However, using large heaps e.g. 128+ GB mixed generation collections exceeding the ZooKeeper heartbeat timeout are inevitable under mixed read+write load, nothing mitigates that sufficiently that I have found. It looks like you have done some bucket cache research and tests. Are there any numbers available? We are considering upgrading to 0.96 release and bucket cache is the major attraction for us. According to you, its not that usable or it does not give any performance advantage? I really doubt, that 80GB on heap block cache is viable alternative to off heap cache in mixed read/write load scenario even in Java7 with G1. One thing to note: having serialization barrier has one huge advantage over direct off heap access. You can compress blocks in off heap. For our application compression ratio is close to 4.
          Hide
          Andrew Purtell added a comment -

          It is on my to do list to produce a technical report, but my time is quite constrained and that item is not close to the top of the list. As always, you should evaluate HBase using your application and environment. You may be quite happy with 0.96, with or without the bucket cache.

          having serialization barrier has one huge advantage over direct off heap access. You can compress blocks in off heap

          That's a great point. I would actually like to operate on an encoded block representation from disk to socket. This is a trick in memory databases have been using for years, and will let us push through the memory wall, but that is several steps down a long road. The scope of this JIRA is described in the 'Description' field above.

          Show
          Andrew Purtell added a comment - It is on my to do list to produce a technical report, but my time is quite constrained and that item is not close to the top of the list. As always, you should evaluate HBase using your application and environment. You may be quite happy with 0.96, with or without the bucket cache. having serialization barrier has one huge advantage over direct off heap access. You can compress blocks in off heap That's a great point. I would actually like to operate on an encoded block representation from disk to socket. This is a trick in memory databases have been using for years, and will let us push through the memory wall, but that is several steps down a long road. The scope of this JIRA is described in the 'Description' field above.
          Hide
          stack added a comment -

          Andrew Purtell There are a couple of off-heap experiments ongoing. This JIRA covers memstore and blockcache allocations. Seems like we need a larger umbrella issue than this allows? If you agree I'll open one because would be useful be able to tie all effots. Good on you Andrew Purtell

          Show
          stack added a comment - Andrew Purtell There are a couple of off-heap experiments ongoing. This JIRA covers memstore and blockcache allocations. Seems like we need a larger umbrella issue than this allows? If you agree I'll open one because would be useful be able to tie all effots. Good on you Andrew Purtell
          Hide
          Andrew Purtell added a comment -

          If you want to reparent this somewhere that's fine with me stack. We're going to start with memstore and blockcache (likely a unified pool) and go from there based on results. If there are other things going on would be good to put them all together so we can try to coordinate.

          Show
          Andrew Purtell added a comment - If you want to reparent this somewhere that's fine with me stack . We're going to start with memstore and blockcache (likely a unified pool) and go from there based on results. If there are other things going on would be good to put them all together so we can try to coordinate.
          Hide
          Matt Corgan added a comment -

          Something to keep in mind is that GC pauses can be influenced as much or more by the number of live objects as they can by the raw size of the heap. 32GB of block cache could be made of only 1mm 32KB blocks. This particular 32GB of memory may not stop the world for very long. It's all the small remaining objects that are keeping the garbage collector busy, and I bet the biggest culprit here is the individual KeyValues in the memstores.

          MemstoreLAB combines the backing arrays into big chunks to reduce heap fragmentation, but there is still one object per KeyValue, and each object needs to be considered by the collector. A big heap has big memstores, which have lots of KeyValues - possibly far more than the 1mm blocks in the block cache. A big advantage of flattening the memstores into blocks of key values is that you might be reducing ~500 KeyValues to a single block object. This 500x reduction in objects strikes me as a significant GC pause improvement that is independent from off-heap techniques.


          Moving blocks off-heap and operating on them directly will be very cool. DataBlockEncoders should be able to read off-heap blocks similarly to how they do now, namely, copying only the modified bytes from the previous cell into an array buffer. Vladimir makes a good point that it would be tough to match the scan performance of unencoded data, so that would need some thinking.

          Show
          Matt Corgan added a comment - Something to keep in mind is that GC pauses can be influenced as much or more by the number of live objects as they can by the raw size of the heap. 32GB of block cache could be made of only 1mm 32KB blocks. This particular 32GB of memory may not stop the world for very long. It's all the small remaining objects that are keeping the garbage collector busy, and I bet the biggest culprit here is the individual KeyValues in the memstores. MemstoreLAB combines the backing arrays into big chunks to reduce heap fragmentation, but there is still one object per KeyValue, and each object needs to be considered by the collector. A big heap has big memstores, which have lots of KeyValues - possibly far more than the 1mm blocks in the block cache. A big advantage of flattening the memstores into blocks of key values is that you might be reducing ~500 KeyValues to a single block object. This 500x reduction in objects strikes me as a significant GC pause improvement that is independent from off-heap techniques. Moving blocks off-heap and operating on them directly will be very cool. DataBlockEncoders should be able to read off-heap blocks similarly to how they do now, namely, copying only the modified bytes from the previous cell into an array buffer. Vladimir makes a good point that it would be tough to match the scan performance of unencoded data, so that would need some thinking.
          Hide
          stack added a comment -

          If we supplied DFSClient our own DBB, then maybe we could read from dfs and put into an offheap blockcache w/o going over the heap (see HDFS-2834 "ByteBuffer-based read API for DFSInputStream")

          Show
          stack added a comment - If we supplied DFSClient our own DBB, then maybe we could read from dfs and put into an offheap blockcache w/o going over the heap (see HDFS-2834 "ByteBuffer-based read API for DFSInputStream")
          Hide
          Andrew Purtell added a comment - - edited

          I'm looking at Netty 4's netty-buffer module (http://netty.io/4.0/api/io/netty/buffer/package-summary.html), which has some nice properties, including composite buffers, arena allocation, dynamic buffer resizing, and reference counting, never mind dev and testing by another community. I also like it because you can plug in your own allocators and specialize the abstract ByteBuf base type. More on this later.

          When I get closer to seeing what exactly needs to be done I will post a design doc. Current thinking follows. Below the term 'buffer' currently means Netty ByteBufs or derived classes backed by off-heap allocated direct buffers.

          Write

          When coming in from RPC, cells are laid out by codecs into cellbocks in buffers and the cellblocks/buffers are handed to the memstore. Netty's allocation arenas replace the MemstoreLAB. The memstore data structure evolves into an index over cellblocks.

          Per Matt Corgan's comment above, we should think about how the memstore index can be built with fewer object allocations than the number of cells in the memstore, yet be in the ballpark with efficiency of concurrent access. A tall order. CSLM wouldn't be the right choice as it allocates at least one list entry per key, but we could punt and use it initially and make a replacement datastructure as a follow on task.

          Cellblocks in memstore should be amenable to flushing to disk as a gathering write. This may mean cellblocks have the same internal structure as HFile blocks and we reuse all of the block encoder machinery (and simplify them in the process).

          Read

          We feed down buffers to HDFS to fill with file block data. We pick which pool to get a buffer from for a read depending on family caching strategy. Pools could be backed by arenas that match up with LRU policy strata, with a common pool/arena for noncaching reads. (Or for noncaching reads, can we optionally use a new API for getting buffers up from HDFS, perhaps backed by the pinned shared RAM cache, since we know we will be referring to the contents only briefly?) It will be important to get reference counting right as we will be servicing scans while attempting to evict. Related, eviction of a block may not immediately return a buffer to a pool, if there is more than one block in a buffer.

          We maintain new metrics on numbers of buffers allocated, stats on arenas, stats on wastage and internal fragmentation of the buffers, etc, and use these to guide optimizations and refinements.

          This should require fewer changes than the write side since we are already set up for dealing with cellblocks. Design points to optimize would be minimizing the number and size of data copies, minimizing the number of on-heap object allocations, and on disk encoding suitable as an efficient in-memory representation.

          Show
          Andrew Purtell added a comment - - edited I'm looking at Netty 4's netty-buffer module ( http://netty.io/4.0/api/io/netty/buffer/package-summary.html ), which has some nice properties, including composite buffers, arena allocation, dynamic buffer resizing, and reference counting, never mind dev and testing by another community. I also like it because you can plug in your own allocators and specialize the abstract ByteBuf base type. More on this later. When I get closer to seeing what exactly needs to be done I will post a design doc. Current thinking follows. Below the term 'buffer' currently means Netty ByteBufs or derived classes backed by off-heap allocated direct buffers. Write When coming in from RPC, cells are laid out by codecs into cellbocks in buffers and the cellblocks/buffers are handed to the memstore. Netty's allocation arenas replace the MemstoreLAB. The memstore data structure evolves into an index over cellblocks. Per Matt Corgan 's comment above, we should think about how the memstore index can be built with fewer object allocations than the number of cells in the memstore, yet be in the ballpark with efficiency of concurrent access. A tall order. CSLM wouldn't be the right choice as it allocates at least one list entry per key, but we could punt and use it initially and make a replacement datastructure as a follow on task. Cellblocks in memstore should be amenable to flushing to disk as a gathering write. This may mean cellblocks have the same internal structure as HFile blocks and we reuse all of the block encoder machinery (and simplify them in the process). Read We feed down buffers to HDFS to fill with file block data. We pick which pool to get a buffer from for a read depending on family caching strategy. Pools could be backed by arenas that match up with LRU policy strata, with a common pool/arena for noncaching reads. (Or for noncaching reads, can we optionally use a new API for getting buffers up from HDFS, perhaps backed by the pinned shared RAM cache, since we know we will be referring to the contents only briefly?) It will be important to get reference counting right as we will be servicing scans while attempting to evict. Related, eviction of a block may not immediately return a buffer to a pool, if there is more than one block in a buffer. We maintain new metrics on numbers of buffers allocated, stats on arenas, stats on wastage and internal fragmentation of the buffers, etc, and use these to guide optimizations and refinements. This should require fewer changes than the write side since we are already set up for dealing with cellblocks. Design points to optimize would be minimizing the number and size of data copies, minimizing the number of on-heap object allocations, and on disk encoding suitable as an efficient in-memory representation.
          Hide
          Lars Hofhansl added a comment - - edited

          My office neighbor used to work on a proprietary Java database, and he says they used 128GB or even 192GB Java heaps and larger all the time without any significant GC impact.

          (non moving) Collection times are not a function of the heap size but rather of heap complexity, i.e. the number of objects to track (HBase also produces a lot of garbage, but that is short lived and can be quickly collected by a moving collector for the young gen).
          With memstoreLAB and the block cache HBase already does a good job on this. Even as is currently, if we fill an entire 128GB of heap with 64k blocks from the blockcache that would only be about 2m objects.
          Now, if we want < 100ms latency area we need to rethink things; that will generally be very difficult in current Java.

          While we move all-or-nothing everything out of the Java heap, we should also investigate whether we can make the GC's life easier, yet.

          Edit: Edited for clarity.

          Show
          Lars Hofhansl added a comment - - edited My office neighbor used to work on a proprietary Java database, and he says they used 128GB or even 192GB Java heaps and larger all the time without any significant GC impact. (non moving) Collection times are not a function of the heap size but rather of heap complexity, i.e. the number of objects to track (HBase also produces a lot of garbage, but that is short lived and can be quickly collected by a moving collector for the young gen). With memstoreLAB and the block cache HBase already does a good job on this. Even as is currently, if we fill an entire 128GB of heap with 64k blocks from the blockcache that would only be about 2m objects. Now, if we want < 100ms latency area we need to rethink things; that will generally be very difficult in current Java. While we move all-or-nothing everything out of the Java heap, we should also investigate whether we can make the GC's life easier, yet. Edit: Edited for clarity.
          Hide
          Andrew Purtell added a comment -

          I intend to prototype something so we don't have to argue supposition.

          Yes enabling sub 100 ms collections at 95th or 99th is an important consideration. We also want to consider addressing up 1 TB of usable memory without loading up cores with redundant work / multiple processes.

          Some GC overheads are a linear function of the heap size, at least for G1.

          Show
          Andrew Purtell added a comment - I intend to prototype something so we don't have to argue supposition. Yes enabling sub 100 ms collections at 95th or 99th is an important consideration. We also want to consider addressing up 1 TB of usable memory without loading up cores with redundant work / multiple processes. Some GC overheads are a linear function of the heap size, at least for G1.
          Hide
          Lars Hofhansl added a comment -

          Yeah, was talking about CMS and definitely less than 1TB.

          Please do not read my comment as criticism, this is very important work.
          No doubt you can drive max latency down significantly by going off heap, at the same time are probably a lot of further improvement we make to current HBase in the heap allocation area.

          Show
          Lars Hofhansl added a comment - Yeah, was talking about CMS and definitely less than 1TB. Please do not read my comment as criticism, this is very important work. No doubt you can drive max latency down significantly by going off heap, at the same time are probably a lot of further improvement we make to current HBase in the heap allocation area.
          Hide
          Vladimir Rodionov added a comment -

          We also want to consider addressing up 1 TB of usable memory without loading up cores with redundant work / multiple processes.

          6TB of RAM.
          http://www.supermicro.nl/newsroom/pressreleases/2014/press140218_4U_4-Way.cfm

          Collection times are not a function of the heap size but rather of heap complexity, i.e. the number of objects to track

          Heap compaction is a function of a heap size (at least in CMS).

          Show
          Vladimir Rodionov added a comment - We also want to consider addressing up 1 TB of usable memory without loading up cores with redundant work / multiple processes. 6TB of RAM. http://www.supermicro.nl/newsroom/pressreleases/2014/press140218_4U_4-Way.cfm Collection times are not a function of the heap size but rather of heap complexity, i.e. the number of objects to track Heap compaction is a function of a heap size (at least in CMS).
          Hide
          Lars Hofhansl added a comment -

          Heap compaction is a function of a heap size (at least in CMS).

          Not to start a long, tangential argument here... Last I looked CMS was non-compacting, and thus the only relevant metric is the number of objects to trace, not their size. A 100G heap with 10000 objects is far easier to manage than a 100G heap with 100 million objects.

          Show
          Lars Hofhansl added a comment - Heap compaction is a function of a heap size (at least in CMS). Not to start a long, tangential argument here... Last I looked CMS was non-compacting, and thus the only relevant metric is the number of objects to trace, not their size. A 100G heap with 10000 objects is far easier to manage than a 100G heap with 100 million objects.
          Hide
          Vladimir Rodionov added a comment -

          Right, CMS is not compacting but, nevertheless, compaction happens from time to time (Full GC) and it is a function of a heap size.

          Show
          Vladimir Rodionov added a comment - Right, CMS is not compacting but, nevertheless, compaction happens from time to time (Full GC) and it is a function of a heap size.
          Hide
          Lars Hofhansl added a comment -

          (not if all objects are of roughly the same size then you will never need a full GC)

          In any case, nobody is arguing (at least I am not) that 1T or more (6T? Wow) should be managed off-heap with contemporary Hotspot JVMs. I'm looking forward to what Andrew and folks will produce here.

          Show
          Lars Hofhansl added a comment - (not if all objects are of roughly the same size then you will never need a full GC) In any case, nobody is arguing (at least I am not) that 1T or more (6T? Wow) should be managed off-heap with contemporary Hotspot JVMs. I'm looking forward to what Andrew and folks will produce here.
          Hide
          Matt Corgan added a comment -

          I hate to continue the tangent, but I'd add that even the occasional compaction that CMS triggers is dependent on how many objects need to be compacted. It's because "random" access memory isn't as random anymore because there are enormous speed boosts when copying long swaths of sequential memory. So compacting 100 1GB slabs should be far faster than compacting 1 billion 100B KeyValues that are scattered around the heap. I also wonder if there's a slab size big enough that hotspot won't bother moving it during a compaction (but i have no idea).

          Separately, one of the reasons Nick and I thought ByteRange should be an interface was that we could back it with varying implementations including arrays, HeapByteBuffers, DirectByteBuffers, netty ByteBufs, etc. A utility similar to IOUtils.copy could help optimizing the copies between the different implementations. Another advantage of using it as the primary interface is that its internal compareTo method uses hbase-friendly unsigned byte comparison, making it easy to put ByteRanges into traditional sorted collections like TreeSet/CSLM without passing an external comparator.

          I could see using an allocator based on huge on or off-heap slabs where smaller pages/blocks are referenced by reusable ByteRanges. The allocator could recycle memory by continuously picking the least utilized slab and copying (moving) its occupied ByteRanges to the slab at the head of the queue. This would provide constant compaction via fast sequential copying.

          Show
          Matt Corgan added a comment - I hate to continue the tangent, but I'd add that even the occasional compaction that CMS triggers is dependent on how many objects need to be compacted. It's because "random" access memory isn't as random anymore because there are enormous speed boosts when copying long swaths of sequential memory. So compacting 100 1GB slabs should be far faster than compacting 1 billion 100B KeyValues that are scattered around the heap. I also wonder if there's a slab size big enough that hotspot won't bother moving it during a compaction (but i have no idea). Separately, one of the reasons Nick and I thought ByteRange should be an interface was that we could back it with varying implementations including arrays, HeapByteBuffers, DirectByteBuffers, netty ByteBufs, etc. A utility similar to IOUtils.copy could help optimizing the copies between the different implementations. Another advantage of using it as the primary interface is that its internal compareTo method uses hbase-friendly unsigned byte comparison, making it easy to put ByteRanges into traditional sorted collections like TreeSet/CSLM without passing an external comparator. I could see using an allocator based on huge on or off-heap slabs where smaller pages/blocks are referenced by reusable ByteRanges. The allocator could recycle memory by continuously picking the least utilized slab and copying (moving) its occupied ByteRanges to the slab at the head of the queue. This would provide constant compaction via fast sequential copying.
          Hide
          stack added a comment -

          (Good discussion going on here)

          How then to have KeyValues/Cells w/o calling them out as individual objects? Keep cellblocks of KeyValues/Cells w/ a CellScanner to read over 64k blocks of them? For MemStore, once we hit some upper bound – say 64k, 1M? – 'flush' it to an inmemory, sorted, cellblock? Reading, we'd consult the (small) CSLM memstore and some tiering of cellblocks?

          Show
          stack added a comment - (Good discussion going on here) How then to have KeyValues/Cells w/o calling them out as individual objects? Keep cellblocks of KeyValues/Cells w/ a CellScanner to read over 64k blocks of them? For MemStore, once we hit some upper bound – say 64k, 1M? – 'flush' it to an inmemory, sorted, cellblock? Reading, we'd consult the (small) CSLM memstore and some tiering of cellblocks?
          Hide
          Lars Hofhansl added a comment -

          HBASE-5311 and HBASE-9440 have related discussion. If we're smart we can build all these things such that they work on- and off heap.

          Show
          Lars Hofhansl added a comment - HBASE-5311 and HBASE-9440 have related discussion. If we're smart we can build all these things such that they work on- and off heap.
          Hide
          Andrew Purtell added a comment -

          (Matt Corgan) I could see using an allocator based on huge on or off-heap slabs where smaller pages/blocks are referenced by reusable ByteRanges. The allocator could recycle memory by continuously picking the least utilized slab and copying (moving) its occupied ByteRanges to the slab at the head of the queue. This would provide constant compaction via fast sequential copying.

          We could make the investment of writing our own slab allocator. Experiments with Netty 4 ByteBufs are in part about seeing if we can re-use open source in production already rather than redo the work. On the other hand, it could be a crucial component so maybe it's necessary to have complete control. Perhaps we can move additional comments on this sub-topic over to HBASE-10573?

          Show
          Andrew Purtell added a comment - (Matt Corgan) I could see using an allocator based on huge on or off-heap slabs where smaller pages/blocks are referenced by reusable ByteRanges. The allocator could recycle memory by continuously picking the least utilized slab and copying (moving) its occupied ByteRanges to the slab at the head of the queue. This would provide constant compaction via fast sequential copying. We could make the investment of writing our own slab allocator. Experiments with Netty 4 ByteBufs are in part about seeing if we can re-use open source in production already rather than redo the work. On the other hand, it could be a crucial component so maybe it's necessary to have complete control. Perhaps we can move additional comments on this sub-topic over to HBASE-10573 ?
          Hide
          Matt Corgan added a comment -

          How then to have KeyValues/Cells w/o calling them out as individual objects? .... For MemStore, once we hit some upper bound – say 64k, 1M? – 'flush' it to an inmemory, sorted, cellblock? Reading, we'd consult the (small) CSLM memstore and some tiering of cellblocks?

          I think there's been talk of this before, and it makes sense to me. It's basically creating small in-memory HFiles that can be compacted several times in memory without going to disk, and holding on to the WAL entries until they do go to disk. We'd get huge space savings from reduction in objects, references, and repetition via block encoding. The problem is that if you have hundreds of 1MB in-memory HFiles, then it becomes too expensive to merge them all (via KVHeap) when scanning. A possible solution is to subdivide the memstore into "stripes" (probably smaller than the stripe compaction stripes) and periodically compact the in-memory stripes. It sounds complicated compared to the current memstore, but it's probably simpler than other parts of hbase because you don't have to deal with IOExceptions, retries, etc.

          Show
          Matt Corgan added a comment - How then to have KeyValues/Cells w/o calling them out as individual objects? .... For MemStore, once we hit some upper bound – say 64k, 1M? – 'flush' it to an inmemory, sorted, cellblock? Reading, we'd consult the (small) CSLM memstore and some tiering of cellblocks? I think there's been talk of this before, and it makes sense to me. It's basically creating small in-memory HFiles that can be compacted several times in memory without going to disk, and holding on to the WAL entries until they do go to disk. We'd get huge space savings from reduction in objects, references, and repetition via block encoding. The problem is that if you have hundreds of 1MB in-memory HFiles, then it becomes too expensive to merge them all (via KVHeap) when scanning. A possible solution is to subdivide the memstore into "stripes" (probably smaller than the stripe compaction stripes) and periodically compact the in-memory stripes. It sounds complicated compared to the current memstore, but it's probably simpler than other parts of hbase because you don't have to deal with IOExceptions, retries, etc.
          Hide
          Andrew Purtell added a comment - - edited

          The problem is that if you have hundreds of 1MB in-memory HFiles, then it becomes too expensive to merge them all (via KVHeap) when scanning. A possible solution is to subdivide the memstore into "stripes" (probably smaller than the stripe compaction stripes) and periodically compact the in-memory stripes

          Anoop, Ram, and I were throwing around ideas of making in-memory HFiles out of memstore snapshots, and then doing in-memory compaction over them. If we have off-heap backing for memstore we could potentially carry larger snapshots (in memory HFiles resulting from a few merged memstore snapshots) leading to less frequent flushes and significantly less write amplification overall.

          Show
          Andrew Purtell added a comment - - edited The problem is that if you have hundreds of 1MB in-memory HFiles, then it becomes too expensive to merge them all (via KVHeap) when scanning. A possible solution is to subdivide the memstore into "stripes" (probably smaller than the stripe compaction stripes) and periodically compact the in-memory stripes Anoop, Ram, and I were throwing around ideas of making in-memory HFiles out of memstore snapshots, and then doing in-memory compaction over them. If we have off-heap backing for memstore we could potentially carry larger snapshots (in memory HFiles resulting from a few merged memstore snapshots) leading to less frequent flushes and significantly less write amplification overall.
          Hide
          stack added a comment -

          Matt Corgan

          It's basically creating small in-memory HFiles that can be compacted several times in memory without going to disk, and holding on to the WAL entries until they do go to disk.

          Pardon dumb questions, "creating small in-memory HFiles..." – from a small CSLM that does the sort for us? Or, I remember talking to Martin Thompson once trying to ask how he'd go about the MemStore 'problem' and I'm sure he didn't follow what I was on about (I was doing a crappy job explaining I'm sure),, but other than his usual adage of try everything and measure, he suggested just trying a sort on the fly... Are you thinking the same Matt? So we'd keep around Cells and then once we had a batch or if after some nanos had elapsed, we'd do a merge sort w/ current set of in-memory edits and then put in place the new sorted 'in-memory-hfile' and up the mvcc read point so it was readable? Once they got to a certain size we'd do like we do now with snapshot and start up a new foreground set of edits to merge into?

          ...and holding on to the WAL entries until they do go to disk

          What you thinking here? Would be good if the WAL system was not related to the MemStore system (though chatting w/ Liyin Tang recently, he had an idea that would make the WAL sync more 'live' if WAL sync updated mvcc (mvcc and seqid being tied).

          Anoop, Ram, and I were throwing around ideas of making in-memory HFiles out of memstore snapshots....

          Would be sweet if the value at least was not on heap.... Sounds like nice experiment Andrew.

          Show
          stack added a comment - Matt Corgan It's basically creating small in-memory HFiles that can be compacted several times in memory without going to disk, and holding on to the WAL entries until they do go to disk. Pardon dumb questions, "creating small in-memory HFiles..." – from a small CSLM that does the sort for us? Or, I remember talking to Martin Thompson once trying to ask how he'd go about the MemStore 'problem' and I'm sure he didn't follow what I was on about (I was doing a crappy job explaining I'm sure),, but other than his usual adage of try everything and measure, he suggested just trying a sort on the fly... Are you thinking the same Matt? So we'd keep around Cells and then once we had a batch or if after some nanos had elapsed, we'd do a merge sort w/ current set of in-memory edits and then put in place the new sorted 'in-memory-hfile' and up the mvcc read point so it was readable? Once they got to a certain size we'd do like we do now with snapshot and start up a new foreground set of edits to merge into? ...and holding on to the WAL entries until they do go to disk What you thinking here? Would be good if the WAL system was not related to the MemStore system (though chatting w/ Liyin Tang recently, he had an idea that would make the WAL sync more 'live' if WAL sync updated mvcc (mvcc and seqid being tied). Anoop, Ram, and I were throwing around ideas of making in-memory HFiles out of memstore snapshots.... Would be sweet if the value at least was not on heap.... Sounds like nice experiment Andrew.
          Hide
          Matt Corgan added a comment -

          "creating small in-memory HFiles..." – from a small CSLM that does the sort for us?

          yes, that is all i meant. The CSLM would remain small because it gets flushed more often. I don't doubt there are better ways to do it than the CSLM (like the deferred sorting you mention), but even just shrinking the size of the CSLM would be an improvement without having to re-think the memstore's concurrency mechanisms.

          Let's say you have a 500MB memstore limit, and that encodes (not compresses) to 100MB. You could:

          • split it into 10 stripes, each with ~50MB limit, and flush each of the 10 stripes (to memory) individually
            • you probably have a performance boost already because 10 50MB CSLMs is better than 1 500MB CSLM
          • for a given stripe, flush the CSLM each time it reaches 25MB, which will spit out 5MB encoded "memory hfile" to the off-heap storage
          • optionally compact a stripe's "memory hfiles" in the background to increase read performance
          • when a stripe has 25MB CSLM + 5 encoded snapshots, flush/compact the whole thing to disk
          • "release" the WAL entries for the stripe

          On the WAL entries, i was just pointing out that you can no longer release the WAL entries when you flush the CSLM. You have to hold on to the WAL entries until you flush the "memory hfiles" to disk.

          Show
          Matt Corgan added a comment - "creating small in-memory HFiles..." – from a small CSLM that does the sort for us? yes, that is all i meant. The CSLM would remain small because it gets flushed more often. I don't doubt there are better ways to do it than the CSLM (like the deferred sorting you mention), but even just shrinking the size of the CSLM would be an improvement without having to re-think the memstore's concurrency mechanisms. Let's say you have a 500MB memstore limit, and that encodes (not compresses) to 100MB. You could: split it into 10 stripes, each with ~50MB limit, and flush each of the 10 stripes (to memory) individually you probably have a performance boost already because 10 50MB CSLMs is better than 1 500MB CSLM for a given stripe, flush the CSLM each time it reaches 25MB, which will spit out 5MB encoded "memory hfile" to the off-heap storage optionally compact a stripe's "memory hfiles" in the background to increase read performance when a stripe has 25MB CSLM + 5 encoded snapshots, flush/compact the whole thing to disk "release" the WAL entries for the stripe On the WAL entries, i was just pointing out that you can no longer release the WAL entries when you flush the CSLM. You have to hold on to the WAL entries until you flush the "memory hfiles" to disk.
          Hide
          ramkrishna.s.vasudevan added a comment - - edited

          Would be sweet if the value at least was not on heap

          Yes, this could be a nice one. So I think before doing this the usage of Cell should be in place.

          {Got added by mistake.}
          Show
          ramkrishna.s.vasudevan added a comment - - edited Would be sweet if the value at least was not on heap Yes, this could be a nice one. So I think before doing this the usage of Cell should be in place. {Got added by mistake.}
          Hide
          Yu Li added a comment -

          Hi Matt Corgan and stack,

          I find you ever had a discussion long ago in HBASE-3484 (here), but it seems no further progress since then. And [~mcorban] I find you have more detailed design thought now according to your above comment, so I'm wondering whether you have done some real work to implement this design? Or any plan?

          Actually I think the design you proposed is kind of different from the JIRA topic here or in HBASE-3484, since it's more like an in-memory-flush to reduce memory fragmentation rather than "move off heap". I'm wondering whether it would be better to open another JIRA to make the discussion more explicit, while leaving the "off heap" discussion here?

          I've been watching this thread or say this topic for some while and now we've decided to do similar improvement to our online hbase service here, so I'd really like to work together with community to complete the design and implementation of the "in-memory-flush" stuff.

          I'm totally new face here in this discussion, so please kindly forgive me if I've stated anything naive.

          Show
          Yu Li added a comment - Hi Matt Corgan and stack , I find you ever had a discussion long ago in HBASE-3484 ( here ), but it seems no further progress since then. And [~mcorban] I find you have more detailed design thought now according to your above comment, so I'm wondering whether you have done some real work to implement this design? Or any plan? Actually I think the design you proposed is kind of different from the JIRA topic here or in HBASE-3484 , since it's more like an in-memory-flush to reduce memory fragmentation rather than "move off heap". I'm wondering whether it would be better to open another JIRA to make the discussion more explicit, while leaving the "off heap" discussion here? I've been watching this thread or say this topic for some while and now we've decided to do similar improvement to our online hbase service here, so I'd really like to work together with community to complete the design and implementation of the "in-memory-flush" stuff. I'm totally new face here in this discussion, so please kindly forgive me if I've stated anything naive.
          Hide
          Anoop Sam John added a comment -

          Yu Li I am working on this stuff of CellBlocks. (Yes in memory flushes) Coding wise mostly it is done and will do perf tests also. Some time I had worked in HBASE-3484 but later dropped. Ya here along with Off heap , the discussion of CellBlocks also came in. This can greatly reduce the issue we face today with CSLM (When there are too many KVs in it). We are parallely working on the Off heap stuff also. My code is like in a combined form now. Let me seperate it out. Also see HBASE-10648 which will allow us to have different MemStore impls.

          Show
          Anoop Sam John added a comment - Yu Li I am working on this stuff of CellBlocks. (Yes in memory flushes) Coding wise mostly it is done and will do perf tests also. Some time I had worked in HBASE-3484 but later dropped. Ya here along with Off heap , the discussion of CellBlocks also came in. This can greatly reduce the issue we face today with CSLM (When there are too many KVs in it). We are parallely working on the Off heap stuff also. My code is like in a combined form now. Let me seperate it out. Also see HBASE-10648 which will allow us to have different MemStore impls.
          Hide
          Matt Corgan added a comment -

          Yu Li you're right, flushing the memstore to memory is a separate issue than off-heap storage, but it's important to mention here so off-heap storage can be designed to support it. My comments about splitting the memstore into stripes could also be a separate issue since it's just an improvement that saves you some in-memory compaction work on non-uniform data distributions.

          Show
          Matt Corgan added a comment - Yu Li you're right, flushing the memstore to memory is a separate issue than off-heap storage, but it's important to mention here so off-heap storage can be designed to support it. My comments about splitting the memstore into stripes could also be a separate issue since it's just an improvement that saves you some in-memory compaction work on non-uniform data distributions.
          Hide
          Yu Li added a comment -

          Hi Anoop Sam John,

          Thanks for the info, really good to know the progress, I almost started to do the impl by myself. It's also great to see the patch of making MemStore impls pluggable almost ready.

          My code is like in a combined form now. Let me seperate it out.

          I guess the code changes about CellBlocks would base on HBASE-10648? I searched but found no seperate JIRA for this CellBlocks impl, would you create one after separating the code out? Really cannot wait to take a look at it.

          Show
          Yu Li added a comment - Hi Anoop Sam John , Thanks for the info, really good to know the progress, I almost started to do the impl by myself. It's also great to see the patch of making MemStore impls pluggable almost ready. My code is like in a combined form now. Let me seperate it out. I guess the code changes about CellBlocks would base on HBASE-10648 ? I searched but found no seperate JIRA for this CellBlocks impl, would you create one after separating the code out? Really cannot wait to take a look at it.
          Hide
          Yu Li added a comment -

          Hi Matt Corgan,

          Got it, thanks for the explanation

          Show
          Yu Li added a comment - Hi Matt Corgan , Got it, thanks for the explanation
          Hide
          Anoop Sam John added a comment -

          would you create one

          HBASE-10713. Will come up with patch soon. Welcome ur suggestions. Pls keep all such discussions under this new jira issue

          Show
          Anoop Sam John added a comment - would you create one HBASE-10713 . Will come up with patch soon. Welcome ur suggestions. Pls keep all such discussions under this new jira issue
          Hide
          Andrew Purtell added a comment -

          Just for documentary purposes at this point, since the implementation is early and has a long way to go, but RedHat recently announced ongoing work on a new GC called Shenandoah, with the stated goals "Reduce GC pause times on extremely large heaps by doing evacuation work concurrently with Java threads and making pause times independent of heap size.".

          Show
          Andrew Purtell added a comment - Just for documentary purposes at this point, since the implementation is early and has a long way to go, but RedHat recently announced ongoing work on a new GC called Shenandoah, with the stated goals "Reduce GC pause times on extremely large heaps by doing evacuation work concurrently with Java threads and making pause times independent of heap size.". JEP: http://openjdk.java.net/jeps/189 Project: http://icedtea.classpath.org/shenandoah/ Source: http://icedtea.classpath.org/hg/shenandoah

            People

            • Assignee:
              Unassigned
              Reporter:
              Andrew Purtell
            • Votes:
              0 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

              • Created:
                Updated:

                Development