Thanks for looking over it Mikhail. I've been looking at it from a few different angles:
DataBlockEncoding is essentially a "codec" between KeyValues and byte's. There is no threading, locking, IO, timeouts, retries, etc. There shouldn't even be exceptions. It would be nice to isolate the codec logic to minimize it's dependencies and to "harden" it as much as possible through pure unit tests, rather than heavyweight integration tests. We should be able to prove the correctness of the codecs and their edge cases without usage of a minicluster, etc. We will still need higher-level tests using the minicluster, but the low level tests should be extremely focused on the codec details. For example, I think some of the DeltaEncoders have some pretty complex and fragile logic that we could test a little more thoroughly if the testing environment were simpler.
It would be good to eventually separate the codecs like this into a separate module than hbase-server. It will isolate the code and tests so developers know that danger lurks inside those modules, and committers can keep a sharp eye out for any changes that affect that module. I'm aiming to make a module called hbase-prefix-tree that can hold the pluggable implementation classes for the implementation, and it may grow to support trie-encoded block indexes and memstores. That some of the other current HFileBlock codecs go into an hbase-codec module even though they don't formally exist yet.
I haven't brought this up because I don't want to be too starry-eyed, but I think it is actually appropriate for the client to make use of the DataBlockEncoders to encode KeyValues on-the-fly over the wire. It brings the same cost/benefit trade-off as encoding for disk space or block cache space. An even more advanced feature would be to pass entire data blocks over the wire for certain use cases (primarily unfiltered scans), and let the client decode them, saving a ton of server cpu. Others have mentioned re-writing the client from scratch for various reasons, and I would love to see these encodings built in from the start.
Stack, Jesse, and I did some brainstorming on the path to modularization, and I suggested the separation of the codecs from the hbase-server module. There's a diagram on
HBASE-5977. We would try to extract and encapsulate the logic for decoding what's inside each type of HFileBlock, creating a class hierarchy of HFileBlock implementations that implement interfaces like BloomChunkBlock. From the perspective of hbase-server, everything behind those interfaces is a perfectly tested, high-performance black box.
I'm actually proposing that hbase-server cannot see into hbase-codec and that hbase-codec cannot see into hbase-server. They both would see into hbase-common where we store the DataBlockEncoding enum and the DataBlockEncoder and EncodedSeeker interfaces. In this case the DataBlockEncoding enum goes into the hbase-common module and contains strings pointing to the implementation classes, and the implementations are instantiated via reflection. This also sets up a simple framework for additional codecs (possibly highly customized to a particular use case) to be developed with minimal effect on hbase-server.
Anyway, hopefully that makes sense. All in all, I see hbase-common as the "kernel" of the project, not simply a code-gateway between client and server. It should contain the core interfaces and classes that are fundamental to the concept of hbase with the assumption that they are simple enough to be thoroughly tested via unit tests.