@Andy That's not essential to storing labels in a metacolumn, though it may be advisable for performance reasons.
Understood. I am not saying this is needed or that metacolumns do not work without. In fact, I think that they are very useful in the context you discussed with Matt, i.e. for example TTLs. I personally think there is a need for two optional features: a) metacolumns - which cover broader rules for a many columns or rows, and b) KV tags - which are carried as low as they can get to retain per cell information.
So for TTL I would think that the tags are too low, yet for security I do think that metacolumns are too "weak" of a guarantee.
@Andy and @Matt: So we may have this as a way to store tags inline with data, with dedup/optimize away if not needed; and we may have Lars' somehow tag structure addition to KV (Lars: what would that look like?). Worth doing a bake-off?
I think this is not either or, but - and Matt please correct me if mistaken - if we add Trie compression then we can leverage the implementation to handle it. If we decide not to merge the two, then we can use my suggestion of adding them to the KV optionally and we can handle the compression implications later.
@Andy: We could agree on criteria such as: Tag storage optimized out if no tags present
Indeed, since we use a new type, no extra storage is needed if no tag is attached.
@Andy: Compartmentalized changes
Agreed, we add a new type and handle that case separately. Though the majority of the code is shared, the new type would trigger the extraction of the tags if called for (which I assume would be done lazily).
@Andy: Generic mechanism for adding, reading, removing, and modifying tags, usable by coprocessors.
These are the KeyValue.addTag(byte name, byte value) and KeyValue.getTag(byte name) helpers I was referring to. The coprocessors has full access that way, since the tags are carried for each KV.
@Andy: No we don't have to mimic the Accumulo API though if the goal here is to be an alternative, it must be possible to build a direct API translation shim that provides the same labelling and visibility semantisc.
Indeed. One of the arguments I hear comparing HBase and Accumulo is the fact that we have no cell level security tagging. That is what this is all about. My proposal is - as much as I can tell - lean (as it uses no extra storage if not used), can be combined with the non-cell level security (you might not want this level of security to avoid extra baggage), does not change the comparators, and overall is quite non-intrusive in existing code. On the other hand it seems useful for other cell level features in the future.
As Jon says, Accumulo uses these tags and the always-on filter to achieve security (on a very high level view), and so can we then. For me this is comparable then. We do not need to comply to the entire API, but feature set level only.
@stack: A core of required's with optional tags that don't cost unless you use them would be grand.
That is exactly my point. As for "KV in KV", I do not see how this is "odd" as our KeyValue for starters is the odd one given what most people understand of what a KV is. Coming to terms with our complex key and various sorting rules is not trivial.
@stack: Good point. Maybe not even lost, mayhaps a bug would cause us skip the metacolumn?
@Matt: I guess I'm saying it's maybe ok to muck up the current KV even more given that data block encoding should be able to clean up the mess down the road. That being said, I don't personally need this feature so I hate to suggest mucking up anything!
Agreed, this is about timing as well. Your patch is highly intrusive - but for good reasons. So I would love to discuss this current issue with your changes already applied. But on the other hand we have to make a call for what we want and when?
@Laxman: The basic premise here is to be on-par security wise with Accumulo. That is the use-case. As for scalability, I do not see why a few extra bytes and a coprocessor that checks them is disastrous. Sure, this needs evaluation, but we know that other systems - like Accumulo - does it, so if someone wants to enable it, they should see the same impact. Small or big. Or asking the other way around, where do you see this could affect the performance?
How about other approach of supporting access control through HBase views?
The issue is that these are typically only on the row level. With the cell level you can filter as fine grained as possible. Views - and please object if I am wrong - are more coarse grained. Think of blocking access to some columns differently across many rows. Not just all CF/CQs allowed for all rows.
That latter is the crucial difference of what is needed to be on-par.