Ok, make sense with this limited scope (no schema) have a fixed list of fields.
Right. In this implementation Struct is a simple concatenation of fields. No schema information is written into that concatenation because to do so will mess with sort order. Struct is merely API convenience. Now, the field encodings implemented in OrderedBytes include a header byte which is currently used to identify the type of encoded field that follows. The full space of 256 available bit patterns in that header bit is not consumed by the current implementation. I've been thinking about extending that header byte to include some version bits at the very beginning. That would enable evolution of the individual field encodings (say, if you later want to re-implement blob-mid, for example). This doesn't address the user-level logical structure of a Struct data type, only evolution of the OrderedBytes codec.
My main concern is: I start use 96 with this struct encoding... is fixed so I can't add fields.. so I work around it adding a version number in front of the struct and then I do the switch for v1, v2, v3 with all the fixed struct that I know...
Prepending a version number to the Struct's members will impact sort order. Struct definition is fixed in that you can't prepend or interpose a new field in the middle of an existing encoded value. You're free to append fields. Appending a field would look like the following:
- application defines Struct v0 with members [A,B,C]
- application writes lots of data
- application changes, Struct v1 becomes [A,B,C,D,E]
- application writes lots more data
At step 3, the application now needs to become version aware. Because the fields of v0 are a subset of v1, the application can use the definition of struct v1 with the following safe-guards. (1) Any place where v0 was used, it now needs to be sure to check for end-of-buffer and skip over the two new elements. (2) Anywhere v1 is used, mindful of truncated records and be prepared to only receive the v0 fields. Maybe the API defined around Struct can be improved to support these needs?
Records of v0 and v1 can be intermixed, ie, as rowkeys in the same table. According to the documented sort semantics, they'll sort "left-to-right and depth-first". Meaning, they'll sort first according to v0 values and then within that group, by v1 values.
We leave all of this up to user applications today, so this change management isn't mitigated. Changing a compound rowkey today requires rewriting data (or duplication into a new table). A smarter struct encoding, one that's able to preserve the sorted semantics I've described but that can also track more sophisticated schama change would be very useful indeed – I don't think it exists.
Prepending a version field to a Struct will change the sorting behavior; v0 will sort before v1, &c. IMHO, this is a less flexible migration strategy than the append behavior described above. It's also perfectly valid, and the user of the Struct API is free to do so in their own application. In that case, the application is still version-aware. Instead of being cautious about consuming the potentially truncated records, instead it's executing a scan for each version.
as you said, data evolution is out of the scope. so if you consider this patch just as a "smarter" alternative to the Bytes encoding.
HBASE-8201 is a smarter alternative to Bytes and this ticket adds some higher-level APIs for manipulating them. In short, yes, schema definition and evolution is out of scope.