HBase
  1. HBase
  2. HBASE-68

[hbase] HStoreFiles needlessly store the column family name in every entry

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Not a Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      Today, HStoreFiles keep the entire serialized HStoreKey objects around for every cell in the HStore. Since HStores are 1-1 with column families, this is really unnecessary - you can always surmise the column family by looking at the HStore it belongs to. (This information would ostensibly come from the file name or a header section.) This means that we could remove the column family part of the HStoreKeys we put into the HStoreFile, reducing the size of data stored. This would be a space-saving benefit, removing redundant data, and could be a speed benefit, as you have to scan over less data in memory and transfer less data over the network.

        Issue Links

          Activity

          Hide
          Jim Kellerman added a comment -

          Remember that an HStoreKey contains both the family name and the member name. You could have entries for 'contents:' (just the family name), 'contents:member1', 'contents:member2', etc., and they all get stored in the same HStoreFile.

          So unless you want to create a new object type to be the key, and then add the necessary logic to transform to/from HStoreKeys, I'd say that trading off a little space for time is a benefit, not a fault.

          -1 on this proposal.

          Show
          Jim Kellerman added a comment - Remember that an HStoreKey contains both the family name and the member name. You could have entries for 'contents:' (just the family name), 'contents:member1', 'contents:member2', etc., and they all get stored in the same HStoreFile. So unless you want to create a new object type to be the key, and then add the necessary logic to transform to/from HStoreKeys, I'd say that trading off a little space for time is a benefit, not a fault. -1 on this proposal.
          Hide
          Bryan Duxbury added a comment -

          Jim, I am aware that multiple qualified cells show up in the same store file per row in the same HStoreFile. I'm just suggesting that the part that comes before the qualified name is unnecessary.

          I understand that changing this would necessitate adding a new key type and transform logic, but I'm not convinced that the translation would actually cost that much more time. You have to recognize that even though the data is precomputed, it is probably coming off of a disk on another computer through the network in 64MB blocks. I have to think that the added transmission time of all the redundant data in aggregate is at least as much as the added time it would take to do translation, and possibly more.

          Show
          Bryan Duxbury added a comment - Jim, I am aware that multiple qualified cells show up in the same store file per row in the same HStoreFile. I'm just suggesting that the part that comes before the qualified name is unnecessary. I understand that changing this would necessitate adding a new key type and transform logic, but I'm not convinced that the translation would actually cost that much more time. You have to recognize that even though the data is precomputed, it is probably coming off of a disk on another computer through the network in 64MB blocks. I have to think that the added transmission time of all the redundant data in aggregate is at least as much as the added time it would take to do translation, and possibly more.
          Hide
          Jim Kellerman added a comment -

          Point taken. However since we have done little to no performance analysis to date, I would say that this would be a premature optimization. Let's see where the hot spots are first and address them.

          Show
          Jim Kellerman added a comment - Point taken. However since we have done little to no performance analysis to date, I would say that this would be a premature optimization. Let's see where the hot spots are first and address them.
          Hide
          stack added a comment -

          Jim: I've been profiling while you've been on holidays. Looks like most of the low-hanging fruit has been picked: e.g. RPC serializations and gratuitous object creations in hbase. Apart from updates to memcache – SortedMaps are 'expensive' – the bulk of our time/resources are now in appending and nexting over MapFiles/SequenceFiles whether updating, reading, compacting or flushing (The latter two take up the bulk of CPU during writes at least). Anything we can do to improve our i/o story here will make for a win.

          As to dropping family name when we go to the fs, I like the idea, especially as its making keys (slightly) smaller... but yeah, lets measure first to see if these seemingly small savings even show up on the size/speed register.

          Show
          stack added a comment - Jim: I've been profiling while you've been on holidays. Looks like most of the low-hanging fruit has been picked: e.g. RPC serializations and gratuitous object creations in hbase. Apart from updates to memcache – SortedMaps are 'expensive' – the bulk of our time/resources are now in appending and nexting over MapFiles/SequenceFiles whether updating, reading, compacting or flushing (The latter two take up the bulk of CPU during writes at least). Anything we can do to improve our i/o story here will make for a win. As to dropping family name when we go to the fs, I like the idea, especially as its making keys (slightly) smaller... but yeah, lets measure first to see if these seemingly small savings even show up on the size/speed register.
          Hide
          Jim Kellerman added a comment -

          Fixed with implementation of HFile

          Show
          Jim Kellerman added a comment - Fixed with implementation of HFile
          Hide
          stack added a comment -

          Should we close this? We still store family with every entry.

          Show
          stack added a comment - Should we close this? We still store family with every entry.
          Hide
          Jim Kellerman added a comment -

          Oops! I thought we had changed that. Sorry.

          Show
          Jim Kellerman added a comment - Oops! I thought we had changed that. Sorry.
          Hide
          Jonathan Gray added a comment -

          Though not fixed/solved, I think we should close this issue as invalid.

          KVs must always contain their families because they are self-contained. Moving forward if we ever do locality groups, we'll definitely need them.

          By making them self-contained, we never have to rewrite/reallocate the data. ie. our zero-copy reads pass along KV references to the actual HFile block we read in. Our Result is nothing but a List<KV> and we do not care whether they are all the same family or multiple families or whatever. If our KV no longer stores the family we will have to undo the new optimizations (of not building a big Map as we build the Result) and start to track everything per family as we build the Result.

          All other issues outlined above like gratuitous object creations are also gone. This optimization would only undo them.

          +1 here for closing issue

          Show
          Jonathan Gray added a comment - Though not fixed/solved, I think we should close this issue as invalid. KVs must always contain their families because they are self-contained. Moving forward if we ever do locality groups, we'll definitely need them. By making them self-contained, we never have to rewrite/reallocate the data. ie. our zero-copy reads pass along KV references to the actual HFile block we read in. Our Result is nothing but a List<KV> and we do not care whether they are all the same family or multiple families or whatever. If our KV no longer stores the family we will have to undo the new optimizations (of not building a big Map as we build the Result) and start to track everything per family as we build the Result. All other issues outlined above like gratuitous object creations are also gone. This optimization would only undo them. +1 here for closing issue
          Hide
          Bryan Duxbury added a comment -

          The idea of locality groups seems speculative, and clearly if we did that then this issue would be invalid from the get go. However, I don't see why KVs couldn't be reconstituted in part from the store file and part from the store file metadata when they are created, rather than writing that data to HDFS. Those values are actually constants, too, so each KV could just keep a reference to the constant object to use when writing in response to client requests.

          I think it would at least be interesting to measure the potential impact of this change. For people with lots of cells, lots of versions, or both, I could see this saving a substantial amount of disk and memory space.

          Show
          Bryan Duxbury added a comment - The idea of locality groups seems speculative, and clearly if we did that then this issue would be invalid from the get go. However, I don't see why KVs couldn't be reconstituted in part from the store file and part from the store file metadata when they are created, rather than writing that data to HDFS. Those values are actually constants, too, so each KV could just keep a reference to the constant object to use when writing in response to client requests. I think it would at least be interesting to measure the potential impact of this change. For people with lots of cells, lots of versions, or both, I could see this saving a substantial amount of disk and memory space.
          Hide
          Jonathan Gray added a comment -

          Locality groups are something we should do. But agree we can treat it separate from this issue as we are not doing it now.

          KVs can not be reconstituted without making every read copy all data twice. We have to read blocks in from hdfs. At that point, we can just pass the entire buffer along and make KV "views" against the big block. Or, we could rewrite the block again reconstituting larger KVs. That could be done when reading in, or when building results. In either case, we are no longer zero-copy read.

          I don't see any way to do this without going backwards towards how things used to work... all the massive improvements we see are because of this consistent, explicit, and immutable KV.

          Show
          Jonathan Gray added a comment - Locality groups are something we should do. But agree we can treat it separate from this issue as we are not doing it now. KVs can not be reconstituted without making every read copy all data twice. We have to read blocks in from hdfs. At that point, we can just pass the entire buffer along and make KV "views" against the big block. Or, we could rewrite the block again reconstituting larger KVs. That could be done when reading in, or when building results. In either case, we are no longer zero-copy read. I don't see any way to do this without going backwards towards how things used to work... all the massive improvements we see are because of this consistent, explicit, and immutable KV.
          Hide
          Jonathan Gray added a comment -

          Debates ensued on IRC. Agreed to punt for now.

          One new idea discussed that we might explore is using codes instead of storing the entire string. Client could rebuild by looking at HTD (which would contain the mapping from code -> family name), or we could send along a little header at the beginning of a Result.

          Show
          Jonathan Gray added a comment - Debates ensued on IRC. Agreed to punt for now. One new idea discussed that we might explore is using codes instead of storing the entire string. Client could rebuild by looking at HTD (which would contain the mapping from code -> family name), or we could send along a little header at the beginning of a Result.
          Hide
          ryan rawson added a comment -

          I give this issue a big -1 (or -2 or whatever).

          Right now we are 'needlessly' storing column family... but in 0.21 I hope to be able to introduce locality groups, which will require us to have column family.

          Another thing is how we don't have to expand/patch up the reply during the regionserver processing of scan/get. This helps quite a bit. Even a code -> string translation would cost us. And with a code-type of solution it would make things more brittle as we can't change and reorder these codes without invalidating an entire table.

          With block compression, and LZO, we get amazing compression... between 2-4x compressions I have seen with production data. This helps mollify the on-disk storage cost of duplicating the column family.

          Show
          ryan rawson added a comment - I give this issue a big -1 (or -2 or whatever). Right now we are 'needlessly' storing column family... but in 0.21 I hope to be able to introduce locality groups, which will require us to have column family. Another thing is how we don't have to expand/patch up the reply during the regionserver processing of scan/get. This helps quite a bit. Even a code -> string translation would cost us. And with a code-type of solution it would make things more brittle as we can't change and reorder these codes without invalidating an entire table. With block compression, and LZO, we get amazing compression... between 2-4x compressions I have seen with production data. This helps mollify the on-disk storage cost of duplicating the column family.
          Hide
          Jim Kellerman added a comment -

          +1 on closing this as "won't fix"

          Show
          Jim Kellerman added a comment - +1 on closing this as "won't fix"
          Hide
          stack added a comment -

          Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.

          Show
          stack added a comment - Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.
          Hide
          stack added a comment -

          Moving out of 0.92. Move it back in if you think differently.

          Show
          stack added a comment - Moving out of 0.92. Move it back in if you think differently.
          Hide
          stack added a comment -

          Moving out of 0.92. Move it back in if you think differently.

          Show
          stack added a comment - Moving out of 0.92. Move it back in if you think differently.
          Hide
          Andrew Purtell added a comment -

          Use block encoding

          Show
          Andrew Purtell added a comment - Use block encoding

            People

            • Assignee:
              Unassigned
              Reporter:
              Bryan Duxbury
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development