Lars and I chatted back on Thursday, and agree about the solution for truly large objects (larger than default hdfs block) would require a new streaming API. We also talked about the importance of making configuration and operations simple.
Lars Hofhansl, can you describe the scale and the load for the hybrid MOB storage system you are have or are working on? It is new to me, and I'd very curious about how things like backups and bulk loads are handled in that system.
The fundamental problem this MOB solution is addressing is the a balance between is the hdfs small file problem and write amplification and performance variability caused by write amplification. Objects that have greater than 64MB+ values are out of scope and we are in agreement about needing a streaming api and that a hdfs+hbase solution seems more reasonable. The goal with the MOB mechanism is to show demonstrable improvements in predictability and scalability when we are too small for where hdfs makes sense and where hbase is non-optimal due to splits and compactions.
In some workloads I've been seeing, (timeseries large sensor dumps, mini indexes, or binary documents or images as blob cells) this feature would potentially be very helpful.
We still cannot stream the mobs. They have to be materialized at both the server and the client (going by the documentation here)
This is true; however this is not the design point we are trying to solve.
As I state above this can be achieved with a HBase/HDFS client alone and better: Store mobs up to a certain size by value in HBase (say 5 or 10mb or so), everything larger goes straight into HDFS with a reference only in HBase. This addresses both the many small files issue in HDFS (only files larger than 5-10mb would end up in HDFS) and the streaming problem for large files in HBase. Also as outlined by me in June we can still make this "transactional" in the HBase sense with a three step protocol: (1) write reference row, (2) stream blob to HDFS, (3) record location in HDFS (that's the commit). This solution is also missing from the initial PDF in the "Existing Solutions" section.
Back in June, JingCheng's response to your comments never got feedback on how you'd manage the small files problem.
Also, in there are two HDFS blob + HBase metadata solutions are explicitly mentioned in section 4.1.2 (v4 design doc) with pros and cons. The solution you propose is actually the first described hdfs+hbase approach – though its pro's and con's don't go into the particulars of the commit protocol (though the two-phase prep and then commit be the commit make sense). The largest concern was in the doc as well – the HDFS small files problem.
Having a separate hdfs file per 100k-10mb value is not a scaleable or long term solution. Let's do an example – lets say we wrote 200M 500KB blobs. This ends up being 100TB of data.
- Using hbase as is, we end up with objects in potentially in 10,000 10GB files.
- Along the way, we'd end up splitting and compacting every 20,000 objects, rewriting large chunks of the 100TB over and over.
- Using hdfs+hbase, we'd end up with a 200M files – a lot more files than the optimal 10000 files that vanilla hbase approach could eventually compact to.
- 200M files would consume ~200GB ram for block records in the NN (200M files * 3 block replicas per file * ~300 bytes per hdfs inode+ blockinfo  -> ~200GB), which is definitely in an uncharted area for NN's – a place where there would likely be GC problems and other negative affects.
"Replication" here can still happen by the client, after all, each file successfully stored in HDFS has a reference in HBase.
The design doc approach actually minimizes the operational changes required to store MOBs. From an operational point of view, a users could just enable the optional feature and potentially take advantage of its potential benefits.
This hdfs+hbase proposed approach actually pushes more complexity into the replication mechanism. For replication to work, the source cluster would now need to add mechanisms to open the mobs on the hdfs and ship them off to the other cluster. The MOB approach is simpler operationally and in code because it can use the normal replication mechanism.
The hdfs+hbase proposed approach would need updates and a new bulk load mechanism. The MOB approach is simpler operationally and in code because it can would use normal bulk loads and compactions would push out eventually push the mobs out. (same IO cost)
The hdfs+hbase proposed approach would need updates to properly handle table snapshots and restoring table snapshots and backups. Naively this seems like we'd either have to do a lot of NN operations to backup the mobs. (1 per mob). Also, we'd need to build new tools to manage export and copy table operations as well. The MOB approach is simpler because the copy/export table mechanisms remain the same, and we can use the same archiver mechanism to manage mob file snapshots (mobs are essentially stored in a special region).
We should use the tools what they were intended for. HBase for key value storage, HDFS for streaming large blobs.
We agree here.
The case for HDFS is weak: 1MB-5MB blobs are not large enough for HDFS – in fact for HDFS this is nearly pathological.
The case for HBase is ok: We are writing key-values that tend to be larger than normal. However, with constant continuous ingest with 1MB-5MB MOBs will likely cause trigger splits more frequently which will trigger unavoidable major compactions. This would occur even if bulk load mechanisms were used.
The case for HBase + MOB is stronger: We are writing key-values that tend to be large. The bulk of the 1MB-5MB MOB data is written off to the MOB path. The metadata for the mob (let's say 100 bytes per) is relatively small and thus compactions and splits will be much rarer (10000x less frequent) than the hbase-only approach. If bulk loads are done, an initial compaction would separate out the mob data and keep the region relatively small.
Just saying using one client API for client convenience is not a reason to put all of this into HBase. A client can easily speak both HBase and HDFS protocols.
The tradeoff being made by the hdfs+hbase approach is opting for more operational complexity vs implementation simplicity. With the hdfs+hbase approach, we'd also introduce new security issues – now users in hbase would have the ability to modify the file system directly, and now manage users and credentials on both hdfs and hbase in sync. With the MOB approach we just rely on HBase's security mechanism, its and should be able have per cell ACLs, vis tags, and all the rest.
Given the choice of 1) making hbase simpler to use by adding some internal complexity or 2) making hbase more operationally difficult by adding new external processes and requiring external integrations to manage parts of its data, we should opt for 1. Making HBase easier to use by removing knobs or making knobs as simple as possible should be the priority.
(Subjectively) I do not like the complexity of this as seen by the various discussions here. That part is just my $0.02 of course.
We agree about not liking complexity. However, the discussion process was public and we described and knocked down several strawmen. I actually initially took the side against adding this feature but have been convinced me that when complete, this would have light operator impact and actually less complex than a full solution that uses the hybrid hdfs+hbase approach.
 https://issues.apache.org/jira/secure/attachment/12651408/Block-Manager-as-a-Service.pdf (see RAM Explosion section)