Re-ordering my responses;
can LazyBinaryColumnarSerDe share some code with LazyBinarySerde?
Yes, it does. For instance, the serialization of LazyBinaryColumnarSerde forwards to LazyBinarySerde.serialize for each field (except for one special case of empty string described below). Similarly the deserialization happens in both cases eventually via LazyBinaryXXX.init. The objectinspector is the same, while the common parts of the object class (ColumnarStruct and BinaryColumnarStruct) has been refactored into a ColumnarStructBase class.
how warnedOnceNullMapKey is used?
To enable the above (reuse of LazyBinarySerde.serialize()), I have made it a static method of LazyBinarySerde. The only member variable which was used in that method was a boolean which was used to issue a LOG warning the first time a null map key is encountered. So, I made that into a parameter so that the existing behaviour for LazyBinarySerde is unchanged (that is a warning is issued once per instance).
why do you need to handle empty string specially? "serializeStream.write(INVALID_UTF__SINGLE_BYTE, 0, 1);" i thought for empty data, we just store data length 0 in rcfile.
I thought since there is no NULLSequence in the new serde, the null should be handled specially. i am missing sth, How do you handle null here?
As you know, column values' length are stored in the key part of rcfile (after run-length encoding, and an optional compressing). A 0 in this recorded length is used as the null indicator. This means that non-null values should occupy one or more bytes when serialized. That was ok with the original LazyBinarySerde.serialize, as primitive numeric types, strings (with their datalength) and complex types (with their datalength) do occupy non-zero bytes. But this is too much redundancy and overhead for the typical case (non-empty strings), so I added an extra parameter "skipLengthPrefix" which skips prefixing string/list/map/struct types with a length prefix. But with this, empty strings become a problem since they need to differentiated from nulls. So I used this special single-byte marker for denoting empty strings. (As a side note, for completeness' sake, I should point out that an instance of a struct which has no fields will be encoded with zero bytes. But this is not allowed by the language so I think we are fine here.)
is getLength() only for null check? if yes, can you call it 'isNull()'? And if the only difference in ColumnarStruct and BinaryColumnarStruct is null check, just curious, how difficult is it to avoid this new BinaryColumnarStruct class?
See above. In general, the length recorded in the key part of rcfile reflects the length of the bytesequence with which the lazyobject should be initialized. The only exception is in the case of empty strings, where the recorded length is 1 (the special marker), but the lazyobject needs be initialized with a 0-length byte sequence.
Recorded length being 0 indicates nulls for lazybinarycolumnar and data being the nullsequence indicates null for lazybinary. The difference between ColumnarStruct and BinaryColumnarStruct is this length/null handling, and the object creation itself, which are now the abstract methods of the common base class.
originally the map comparison is not supported, but this patch added a mapEqualComparer. can we put this in a separate patch? It seems the logic in CrossMapEqualComparer is not correct. (how do you make sure you will get the keys from a map in some kind of same order?)
I put this in this patch since I needed that for the tests that I had added. Do you think I should create a dependant jira and extract this part of the patch to that jira?
Hmm, the logic in crossmapequalcomparer looks ok to me (given the caveats mentioned in the javadoc about broken transitivity of greater-than/less-than.) I am not accessing the keys in tandem, but in a nested loop. Since the number of keys are the same, and the keys are unique, both keys and values matching (as declared by ObjectInspectorUtils.compare) is taken as a match for that pair of key-value pairs.
can you add a 'toString()' for new binary columnar serde, just the same as columnar serde
Done, and patch regenerated after rebasing.