HBase
  1. HBase
  2. HBASE-61

[hbase] Create an HBase-specific MapFile implementation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change

      Description

      Today, HBase uses the Hadoop MapFile class to store data persistently to disk. This is convenient, as it's already done (and maintained by other people . However, it's beginning to look like there might be possible performance benefits to be had from doing an HBase-specific implementation of MapFile that incorporated some precise features.

      This issue should serve as a place to track discussion about what features might be included in such an implementation.

      1. tfile.patch
        356 kB
        stack
      2. tfile3.patch
        400 kB
        stack
      3. cpucalltreetfile.html
        658 kB
        stack
      4. longestkey.patch
        2 kB
        stack
      5. hfile.patch
        385 kB
        stack
      6. hfile2.patch
        387 kB
        stack
      7. hfile3.patch
        375 kB
        stack
      8. HBASE-83.patch
        662 kB
        ryan rawson

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Bryan Duxbury added a comment -

          Here's some of the ideas we're tossing around as a starter:

          • Exclude column family name from the file: Currently we store HStoreKeys, which are serialized to contain row, qualified cell name, and timestamp. However, seeing as how a given MapFile only ever belongs to one column family it's very wasteful to store the same column family name over and over again. In a custom implementation, we wouldn't have to save that data.
          • Separate indices for rows from qualified name and timestamp: Currently, the index in MapFiles is over all records, so the same row can appear in the index more than one time (differentiated by column name/timestamp). If the index just contained row keys, then we could store each row key exactly once, which would point to a record group of qualified names and timestamps (and values of course). Within the record group, there could be another separate small index on qualified name. This would again reduce the size of data stored, size of indices, and make it easier to do things like split regions lexically instead of skewed by cell count.
          • Use random rather than streaming reads: There is some indication that the existing MapFile implementation is optimized for streaming access; HBase supports random reads, which are therefore not efficient under MapFile. It would make sense for us to design our new implementation in such a way that it would be very cheap to do random access.
          Show
          Bryan Duxbury added a comment - Here's some of the ideas we're tossing around as a starter: Exclude column family name from the file: Currently we store HStoreKeys, which are serialized to contain row, qualified cell name, and timestamp. However, seeing as how a given MapFile only ever belongs to one column family it's very wasteful to store the same column family name over and over again. In a custom implementation, we wouldn't have to save that data. Separate indices for rows from qualified name and timestamp: Currently, the index in MapFiles is over all records, so the same row can appear in the index more than one time (differentiated by column name/timestamp). If the index just contained row keys, then we could store each row key exactly once, which would point to a record group of qualified names and timestamps (and values of course). Within the record group, there could be another separate small index on qualified name. This would again reduce the size of data stored, size of indices, and make it easier to do things like split regions lexically instead of skewed by cell count. Use random rather than streaming reads: There is some indication that the existing MapFile implementation is optimized for streaming access; HBase supports random reads, which are therefore not efficient under MapFile. It would make sense for us to design our new implementation in such a way that it would be very cheap to do random access.
          Hide
          Tom White added a comment -

          The current design of MapFile.Reader makes it difficult to write an in-memory implementation. For example, to implement next() it's no good having a copy of the keys and values in memory as you can't copy their values into the Writables passed into the next method. Perhaps Writable should have a readFields(Writable) method? Or maybe the API should change.

          To write an in-memory implementation with the current design, I think you would need to do it at a lower level and hold the data file bytes in memory. Keys and values would be reconstructed each time next() or get() was called, so this would be less efficient than an implementation that cached keys and values.

          Show
          Tom White added a comment - The current design of MapFile.Reader makes it difficult to write an in-memory implementation. For example, to implement next() it's no good having a copy of the keys and values in memory as you can't copy their values into the Writables passed into the next method. Perhaps Writable should have a readFields(Writable) method? Or maybe the API should change. To write an in-memory implementation with the current design, I think you would need to do it at a lower level and hold the data file bytes in memory. Keys and values would be reconstructed each time next() or get() was called, so this would be less efficient than an implementation that cached keys and values.
          Hide
          stack added a comment -

          We need a fast containsKey (Especially so if HADOOP-2513 goes in). Would be sweet if backing implementation was able to satisfy the query out of a full index – i.e. an index that had an entry for every key in the mapfile (expensive) – or that tested membership against a bloom filter.

          Show
          stack added a comment - We need a fast containsKey (Especially so if HADOOP-2513 goes in). Would be sweet if backing implementation was able to satisfy the query out of a full index – i.e. an index that had an entry for every key in the mapfile (expensive) – or that tested membership against a bloom filter.
          Hide
          Jim Kellerman added a comment -

          a fast containsKey could be based on bloom filters (at least it would tell you !containsKey) quickly.

          Show
          Jim Kellerman added a comment - a fast containsKey could be based on bloom filters (at least it would tell you !containsKey) quickly.
          Hide
          Bryan Duxbury added a comment -

          I think it would make sense for us to maintain both a bloom filter and an index on row keys. That way, you can check the filter first to decide if you should check the index. Even if there's been deletions in the region that damage the filter, the index will still answer your question pretty quickly. We can maintain (re-create) the filter during compactions.

          I think we would see huge gains from having an always-on bloom filter, especially for sparser row spaces.

          Show
          Bryan Duxbury added a comment - I think it would make sense for us to maintain both a bloom filter and an index on row keys. That way, you can check the filter first to decide if you should check the index. Even if there's been deletions in the region that damage the filter, the index will still answer your question pretty quickly. We can maintain (re-create) the filter during compactions. I think we would see huge gains from having an always-on bloom filter, especially for sparser row spaces.
          Hide
          Bryan Duxbury added a comment -

          It would also be nice to get a row count quickly from a MapFile, or at least check if the MapFile was empty. Can't do that today.

          Show
          Bryan Duxbury added a comment - It would also be nice to get a row count quickly from a MapFile, or at least check if the MapFile was empty. Can't do that today.
          Hide
          Tom White added a comment -

          If MapFile.Reader were an interface (or an abstract class with a no args constructor) then BloomFilterMapFile.Reader, HalfMapFileReader and caching Readers could be implemented as wrappers instead of in a static hierarchy. This would make it easier to mix and match readers (e.g. with or without caching) without passing all possible parameters in the constructor.

          Show
          Tom White added a comment - If MapFile.Reader were an interface (or an abstract class with a no args constructor) then BloomFilterMapFile.Reader, HalfMapFileReader and caching Readers could be implemented as wrappers instead of in a static hierarchy. This would make it easier to mix and match readers (e.g. with or without caching) without passing all possible parameters in the constructor.
          Hide
          Doug Cutting added a comment -

          > Exclude column family name from the file [ ... ]

          The column family name could be stored in the SequenceFile's metadata, no? MapFile's constructors don't currently permit one to specify metadata, but that'd be easy to add.

          > There is some indication that the existing MapFile implementation is optimized for streaming access [ ... ]

          It shouldn't be. The problem is that mapreduce, what's primarily used to benchmark and debug Hadoop, doesn't do any random access. So it's easy for random-access-related performance problems to sneak into MapFile and HDFS. Both Nutch and HBase depend on efficient random access from Hadoop, primarily through MapFile. We need a good random-access benchmark that someone regularly executes. Perhaps one could be added to the sort benchmark suite, since that is regularly run by Yahoo!? Or someone else could start running regular HBase benchmarks on a grid somewhere?

          Show
          Doug Cutting added a comment - > Exclude column family name from the file [ ... ] The column family name could be stored in the SequenceFile's metadata, no? MapFile's constructors don't currently permit one to specify metadata, but that'd be easy to add. > There is some indication that the existing MapFile implementation is optimized for streaming access [ ... ] It shouldn't be. The problem is that mapreduce, what's primarily used to benchmark and debug Hadoop, doesn't do any random access. So it's easy for random-access-related performance problems to sneak into MapFile and HDFS. Both Nutch and HBase depend on efficient random access from Hadoop, primarily through MapFile. We need a good random-access benchmark that someone regularly executes. Perhaps one could be added to the sort benchmark suite, since that is regularly run by Yahoo!? Or someone else could start running regular HBase benchmarks on a grid somewhere?
          Hide
          Bryan Duxbury added a comment -

          Sometimes, it'd be nice to iterate on the keys of a MapFile without actually reading the data. For instance, check out HStore#getClosestRowBefore - this seeks around checking keys and doesn't even use the value once it's been found. Every read to the underlying SequenceFile for that value is wasted.

          Show
          Bryan Duxbury added a comment - Sometimes, it'd be nice to iterate on the keys of a MapFile without actually reading the data. For instance, check out HStore#getClosestRowBefore - this seeks around checking keys and doesn't even use the value once it's been found. Every read to the underlying SequenceFile for that value is wasted.
          Hide
          Doug Cutting added a comment -

          > it'd be nice to iterate on the keys of a MapFile without actually reading the data

          SequenceFile supports that, so it shouldn't be too hard to add a next(WritableComparable) method to the MapFile API, right?

          Show
          Doug Cutting added a comment - > it'd be nice to iterate on the keys of a MapFile without actually reading the data SequenceFile supports that, so it shouldn't be too hard to add a next(WritableComparable) method to the MapFile API, right?
          Hide
          stack added a comment -

          Bloom Filters:

          + Turns out, particularly since the change where we now have a Memcache per Store rather than one for a whole Region, we know the number of elements we're about to flush out to a Store file. Means we can pick an optimal bloom filter size. Therefore, bloom filters could be enabled by default.
          + We currently provide a choice: General, Counting, and Dynamic. I do not see where we would ever use anything but a General bloom filter (Counting adds deletions, dynamic allows sizing). Therefore, I'd suggest we remove choice of implementations.
          + Bloom filters are not as effective as they could be given that the most popular lookup will be for the 'latest' version of a cell: i.e. the lookup is not for an explicit cell – row/column/ts – but for the most recent version of the cell. So, bloom filters should be populated by row/column and probably not ts. Will have to actually fetch the cell to learn its actual ts.

          Mapfile Indices:

          + If index had an entry for every row/column/ts entry in a Store file/MapFile, we wouldn't need a bloom filter (But it would consume volumes more memory!)
          + Chatting w/ Bryan, mapfile indices could be kept in an LRU. We'd add a means of asking a mapfile for its index. We'd shove it into an LRU or into a Reference Map (For the latter, when memory was low, the index would be dropped and would be refetched on next access).

          Show
          stack added a comment - Bloom Filters: + Turns out, particularly since the change where we now have a Memcache per Store rather than one for a whole Region, we know the number of elements we're about to flush out to a Store file. Means we can pick an optimal bloom filter size. Therefore, bloom filters could be enabled by default. + We currently provide a choice: General, Counting, and Dynamic. I do not see where we would ever use anything but a General bloom filter (Counting adds deletions, dynamic allows sizing). Therefore, I'd suggest we remove choice of implementations. + Bloom filters are not as effective as they could be given that the most popular lookup will be for the 'latest' version of a cell: i.e. the lookup is not for an explicit cell – row/column/ts – but for the most recent version of the cell. So, bloom filters should be populated by row/column and probably not ts. Will have to actually fetch the cell to learn its actual ts. Mapfile Indices: + If index had an entry for every row/column/ts entry in a Store file/MapFile, we wouldn't need a bloom filter (But it would consume volumes more memory!) + Chatting w/ Bryan, mapfile indices could be kept in an LRU. We'd add a means of asking a mapfile for its index. We'd shove it into an LRU or into a Reference Map (For the latter, when memory was low, the index would be dropped and would be refetched on next access).
          Hide
          Jim Kellerman added a comment -

          I ran some performance tests today and the results were not pretty.

          Doing 1,048,576 sequential writes through HBase onto the local file system achieved 3,620 writes per second.

          Writing 1,048,576 records sequentially into a MapFile onto the local file system was slightly better at 5,674 writes per second.

          Show
          Jim Kellerman added a comment - I ran some performance tests today and the results were not pretty. Doing 1,048,576 sequential writes through HBase onto the local file system achieved 3,620 writes per second. Writing 1,048,576 records sequentially into a MapFile onto the local file system was slightly better at 5,674 writes per second.
          Hide
          stack added a comment -

          TFile looks promising

          Show
          stack added a comment - TFile looks promising
          Hide
          stack added a comment -

          Started a wikipage for new file format discussion: http://wiki.apache.org/hadoop/Hbase/NewFileFormat

          Show
          stack added a comment - Started a wikipage for new file format discussion: http://wiki.apache.org/hadoop/Hbase/NewFileFormat
          Hide
          Andrew Purtell added a comment -

          I see this on the agenda for the Hackathon on Jan 30. Maybe some of the goals of HBASE-1024 regarding file level scaling and I/O efficiency can be incorporated here?

          Show
          Andrew Purtell added a comment - I see this on the agenda for the Hackathon on Jan 30. Maybe some of the goals of HBASE-1024 regarding file level scaling and I/O efficiency can be incorporated here?
          Hide
          Jonathan Gray added a comment -

          Andrew, I've added it to the schedule. There's 30 mins on TFile followed by 30 mins of general scalability, so HBASE-1024 should be a good issue to transition the discussion.

          Show
          Jonathan Gray added a comment - Andrew, I've added it to the schedule. There's 30 mins on TFile followed by 30 mins of general scalability, so HBASE-1024 should be a good issue to transition the discussion.
          Hide
          stack added a comment -

          I like the idea of starting out with hbase-1024. Its big picture goals. Interesting though is reading recently bigtable slides, it said 100 regions of 100-200MBs per server (probably so can inmemory)? Maybe the 100 is a misprint? Should be a 1000?

          Show
          stack added a comment - I like the idea of starting out with hbase-1024. Its big picture goals. Interesting though is reading recently bigtable slides, it said 100 regions of 100-200MBs per server (probably so can inmemory)? Maybe the 100 is a misprint? Should be a 1000?
          Hide
          stack added a comment -

          This is tfile patch adapted to hbase; tests that are very long running or that depend on the removed lzo have been disabled or worked on.

          Show
          stack added a comment - This is tfile patch adapted to hbase; tests that are very long running or that depend on the removed lzo have been disabled or worked on.
          Hide
          stack added a comment -

          This patch includes HTFile, the wrapper around TFile. Also a HTFilePerformanceEvaluation that does same as MapFilePerformanceEvaluation. We're writing about 3x faster with tfile but there is something wrong w/ our random accesses. I'm looking into it.

          Show
          stack added a comment - This patch includes HTFile, the wrapper around TFile. Also a HTFilePerformanceEvaluation that does same as MapFilePerformanceEvaluation. We're writing about 3x faster with tfile but there is something wrong w/ our random accesses. I'm looking into it.
          Hide
          stack added a comment -

          Random-reading, here is a picture from there profiler showing where we spend all our time. We're nexting through the tfile block that has the wanted key.

          Show
          stack added a comment - Random-reading, here is a picture from there profiler showing where we spend all our time. We're nexting through the tfile block that has the wanted key.
          Hide
          stack added a comment -

          Patch to tfile that will write longest key seen into file meta. This longest key is then used constructing buffer used by tfile scanners. The resultant scanner buffers should be good deal less than the maximum possible key size, the current buffer size used everytime a scanner is opened (even for the case where we are random-reading to get one value only).

          Show
          stack added a comment - Patch to tfile that will write longest key seen into file meta. This longest key is then used constructing buffer used by tfile scanners. The resultant scanner buffers should be good deal less than the maximum possible key size, the current buffer size used everytime a scanner is opened (even for the case where we are random-reading to get one value only).
          Hide
          stack added a comment -

          Testing, TFile is a good bit slower than MapFile if cells are ~100bytes or less and you are doing a random-access. Its slower even if you subsequently read 30 rows at the offset – even if we use a tfile block size of 8k. If cell values are 1k, tfile is faster than MF.

          So, after profiling and discussion on IRC, thought is that we need something like a stripped down tfile or even a new format altogether. The attached patch is start of my stripping chunking and key and value streams out of TFile. Not finished yet. Intent is to keep most of the TFile API and the underlying block mechanism with its attendant block finding mechanism as well as all the metadata facility and index-on-the end but in the guts of tfile, there'd be the DFSClient FSInput/OutputStream and blocks of byte arrays only. The stripped down TFile is now called HFile.

          Show
          stack added a comment - Testing, TFile is a good bit slower than MapFile if cells are ~100bytes or less and you are doing a random-access. Its slower even if you subsequently read 30 rows at the offset – even if we use a tfile block size of 8k. If cell values are 1k, tfile is faster than MF. So, after profiling and discussion on IRC, thought is that we need something like a stripped down tfile or even a new format altogether. The attached patch is start of my stripping chunking and key and value streams out of TFile. Not finished yet. Intent is to keep most of the TFile API and the underlying block mechanism with its attendant block finding mechanism as well as all the metadata facility and index-on-the end but in the guts of tfile, there'd be the DFSClient FSInput/OutputStream and blocks of byte arrays only. The stripped down TFile is now called HFile.
          Hide
          stack added a comment -

          More stripping. This patch has HFile sort of working again (Its a hackup with ugly byte array copies that we need to remove). I was able to do some basic performance comparisons. If buffer size is 4k, then I can random access 10 byte cells as fast a MapFile. If cells are bigger, HFile outperforms MapFile; e.g. if cell is 100 bytes, HFile is 2x MapFile (These are extremely coarse tests going against local filesystem).

          Need to do more stripping. In particular implement Ryan Rawson idea of carrying HFile block in an nio ByteBuffer giving out new ByteBuffer 'views' when a key or value is asked for rather than copy byte arrays.

          Show
          stack added a comment - More stripping. This patch has HFile sort of working again (Its a hackup with ugly byte array copies that we need to remove). I was able to do some basic performance comparisons. If buffer size is 4k, then I can random access 10 byte cells as fast a MapFile. If cells are bigger, HFile outperforms MapFile; e.g. if cell is 100 bytes, HFile is 2x MapFile (These are extremely coarse tests going against local filesystem). Need to do more stripping. In particular implement Ryan Rawson idea of carrying HFile block in an nio ByteBuffer giving out new ByteBuffer 'views' when a key or value is asked for rather than copy byte arrays.
          Hide
          stack added a comment -

          Latest version of the hfile patch. Scanners work properly now. Stripped down the API. Actually need the SimpleBufferedInputStream between tfile and DFSInputStream – just with smaller buffer size – for sake of increased concurrency. Also need to change how we read so we read the whole block in rather than piecemeal it as tfile currently does. The tfile is block based but reads on backing stream do not pull in whole blocks; it just reads whats needed. This means that there is no whole block to cache if we only read a part and we're decompressing just what we need – so it can be faster in certain circumstance – but this behavior frustrates being able to cache on a block basis or more importantly decompressed blocks.

          I'd work on this next but have been chatting with Ryan Rawson over last few days and he just sent me his rfile patch. Going to help out on that effort for a while.

          Show
          stack added a comment - Latest version of the hfile patch. Scanners work properly now. Stripped down the API. Actually need the SimpleBufferedInputStream between tfile and DFSInputStream – just with smaller buffer size – for sake of increased concurrency. Also need to change how we read so we read the whole block in rather than piecemeal it as tfile currently does. The tfile is block based but reads on backing stream do not pull in whole blocks; it just reads whats needed. This means that there is no whole block to cache if we only read a part and we're decompressing just what we need – so it can be faster in certain circumstance – but this behavior frustrates being able to cache on a block basis or more importantly decompressed blocks. I'd work on this next but have been chatting with Ryan Rawson over last few days and he just sent me his rfile patch. Going to help out on that effort for a while.
          Hide
          stack added a comment -

          Ryan checked in his rfile over here on github: http://github.com/ryanobjc/hbase-rfile/tree/master

          Its up on github so more than one person can bang on it. Notion is first to test rfile vs tfile vs mapfile (I checked in latest hfile into github for contrast) and then whichever wins, make a patch out of the github for this issue.

          I added to github an evaluate RFile using PE. RFile is ahead of MF it looks like using an 8k buffer and 10byte cells. Tomorrow will do more work ensuring all files are returning what they are supposed to and will try compare on dfs.

          Talked to AJ also to day. Suggested playing with pread – DFSDataIS has one – so file can be more 'live'. Suggested also removing buffering on DFSDIS since we're reading in blocks and suggested we also look at receive socket buffer size – maybe add our own socket factory and if block size < socket receive buffer size, use the smaller.

          Show
          stack added a comment - Ryan checked in his rfile over here on github: http://github.com/ryanobjc/hbase-rfile/tree/master Its up on github so more than one person can bang on it. Notion is first to test rfile vs tfile vs mapfile (I checked in latest hfile into github for contrast) and then whichever wins, make a patch out of the github for this issue. I added to github an evaluate RFile using PE. RFile is ahead of MF it looks like using an 8k buffer and 10byte cells. Tomorrow will do more work ensuring all files are returning what they are supposed to and will try compare on dfs. Talked to AJ also to day. Suggested playing with pread – DFSDataIS has one – so file can be more 'live'. Suggested also removing buffering on DFSDIS since we're reading in blocks and suggested we also look at receive socket buffer size – maybe add our own socket factory and if block size < socket receive buffer size, use the smaller.
          Hide
          stack added a comment -

          Here are some numbers comparing file formats: http://wiki.apache.org/hadoop/Hbase/NewFileFormat/Performance

          I tried making DFSDataInputStream work without buffering – that'd help rfile – but seems like stream needs to be markable so it wouldn't work w/o a bunch of reworking.

          I also tried pread. That made a different improving rfile numbers by about 10%.

          Show
          stack added a comment - Here are some numbers comparing file formats: http://wiki.apache.org/hadoop/Hbase/NewFileFormat/Performance I tried making DFSDataInputStream work without buffering – that'd help rfile – but seems like stream needs to be markable so it wouldn't work w/o a bunch of reworking. I also tried pread. That made a different improving rfile numbers by about 10%.
          Hide
          stack added a comment -

          I added concurrent read numbers to the end of this page: http://wiki.apache.org/hadoop/Hbase/NewFileFormat/Performance.

          Show
          stack added a comment - I added concurrent read numbers to the end of this page: http://wiki.apache.org/hadoop/Hbase/NewFileFormat/Performance .
          Hide
          stack added a comment -

          HFile is all done but for bloom filters (compression, metadata, comparators, etc). Might wait on BF for moment. Work is ongoing over in http://github.com/ryanobjc/hbase-rfile/ (Ryan and Stack). Next is making hbase use hfile.

          Show
          stack added a comment - HFile is all done but for bloom filters (compression, metadata, comparators, etc). Might wait on BF for moment. Work is ongoing over in http://github.com/ryanobjc/hbase-rfile/ (Ryan and Stack). Next is making hbase use hfile.
          Hide
          stack added a comment -

          Assigning to Ryan (He did hfile).

          Show
          stack added a comment - Assigning to Ryan (He did hfile).
          Hide
          ryan rawson added a comment -

          This is a complete implemention of HFile plugged into hbase, diffed against the current trunk. The full source is visible at http://github.com/ryanobjc

          Thanks to Stack who did a majority of the integration and test work.

          Show
          ryan rawson added a comment - This is a complete implemention of HFile plugged into hbase, diffed against the current trunk. The full source is visible at http://github.com/ryanobjc Thanks to Stack who did a majority of the integration and test work.
          Hide
          stack added a comment -

          I'm +1 on a commit (All tests pass for me). There is work to do stil integration – in particular mapping the HColumnDescriptor configurations to match new hfile for bloomfilters, compression, and blocksizing – but I'd suggest we do these as separate issues; the patch is big enough already.

          Primitive performance eval. shows random reads up by about 60%, writes up about 25% but scans are down. Will do some profiling over next few days.

          Other notes on the patch:

          + The change to hbase-site.xml is not yet hooked up.
          + This patch breaks binary keys because it undoes the ugly stuff we did to make them work. Will fix again when we address hbase-859 – thats next. In other words, this patch has already started the reworking of HStoreKey removing all the crap where every key had a HREgionInfo reference. One thing in particular that it adds is rawcomparator comparing store keys; that is, no object instantiation.. pure byte compare).
          + The patch is basically a rewrite from HStore down. A few files were renamed because they changed so much – HStore becomes Store, HStoreFile becomes StoreFile, etc.
          + Some pieces of this patch are taken from tfile, hadoop-3315. In particular the hfile tests and much of the compression facility: e.g. BoundedRangeFileInputStream, and Compression types.
          + A few files are missing apache license – we can add one when we commit (simple block cache).

          Show
          stack added a comment - I'm +1 on a commit (All tests pass for me). There is work to do stil integration – in particular mapping the HColumnDescriptor configurations to match new hfile for bloomfilters, compression, and blocksizing – but I'd suggest we do these as separate issues; the patch is big enough already. Primitive performance eval. shows random reads up by about 60%, writes up about 25% but scans are down. Will do some profiling over next few days. Other notes on the patch: + The change to hbase-site.xml is not yet hooked up. + This patch breaks binary keys because it undoes the ugly stuff we did to make them work. Will fix again when we address hbase-859 – thats next. In other words, this patch has already started the reworking of HStoreKey removing all the crap where every key had a HREgionInfo reference. One thing in particular that it adds is rawcomparator comparing store keys; that is, no object instantiation.. pure byte compare). + The patch is basically a rewrite from HStore down. A few files were renamed because they changed so much – HStore becomes Store, HStoreFile becomes StoreFile, etc. + Some pieces of this patch are taken from tfile, hadoop-3315. In particular the hfile tests and much of the compression facility: e.g. BoundedRangeFileInputStream, and Compression types. + A few files are missing apache license – we can add one when we commit (simple block cache).
          Hide
          Jean-Daniel Cryans added a comment -

          Some tests I did:

          Unit tests on my Ubuntu desktop:

          BUILD SUCCESSFUL
          Total time: 25 minutes 39 seconds
          

          11 nodes cluster (2.0GHz CPU, 1GB RAM, 2*80GB HDD JBOD PATA)
          PE ran from the Master node:

          HFile

          Finished sequentialWrite in 484020ms at offset 0 for 1048576 rows  2166 rows/sec
          HBase is restarted and I waited for all compactions to occur
          Finished scan in 166626ms at offset 0 for 1048576 rows  6293 rows/sec
          Finished randomRead in 2711788ms at offset 0 for 1048576 rows  387 rows/sec
          

          MapFile

          Finished sequentialWrite in 496937ms at offset 0 for 1048576 rows  2110 rows/sec
          HBase is restarted and I waited for all compactions to occur
          Finished scan in 153011ms at offset 0 for 1048576 rows  6853 rows/sec
          Finished randomRead in 4270211ms at offset 0 for 1048576 rows 246 rows/sec
          

          So, on this setup, reads are way up, writes are a bit up and scans are a tiny bit down. IMO this is good for a commit if issues stated by Stack are addressed in other jiras.

          +1

          Show
          Jean-Daniel Cryans added a comment - Some tests I did: Unit tests on my Ubuntu desktop: BUILD SUCCESSFUL Total time: 25 minutes 39 seconds 11 nodes cluster (2.0GHz CPU, 1GB RAM, 2*80GB HDD JBOD PATA) PE ran from the Master node: HFile Finished sequentialWrite in 484020ms at offset 0 for 1048576 rows 2166 rows/sec HBase is restarted and I waited for all compactions to occur Finished scan in 166626ms at offset 0 for 1048576 rows 6293 rows/sec Finished randomRead in 2711788ms at offset 0 for 1048576 rows 387 rows/sec MapFile Finished sequentialWrite in 496937ms at offset 0 for 1048576 rows 2110 rows/sec HBase is restarted and I waited for all compactions to occur Finished scan in 153011ms at offset 0 for 1048576 rows 6853 rows/sec Finished randomRead in 4270211ms at offset 0 for 1048576 rows 246 rows/sec So, on this setup, reads are way up, writes are a bit up and scans are a tiny bit down. IMO this is good for a commit if issues stated by Stack are addressed in other jiras. +1
          Hide
          stack added a comment -

          Committed. Thanks for the patch Ryan (And thanks J-D for testing – I've made separate issues for hooking up configuring hfile).

          Show
          stack added a comment - Committed. Thanks for the patch Ryan (And thanks J-D for testing – I've made separate issues for hooking up configuring hfile).

            People

            • Assignee:
              ryan rawson
              Reporter:
              Bryan Duxbury
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development