Here is my initial proposal for the Lazy Binary SerDe. It should have the following properties:
1/ Lazy, which means the real fields are not deserialized until accessed, just like SimpleLazySerDe.
2/ Binary, which means the data are stored in the compact binary format. However it is different from BinarySortable that the stored data does not preserve the orders of the original data. More specifications on how different data types are stored are described below.
2.1/ Null fields in a row. To represent that, we use a single bit to represent whether each filed is null or not. 0b means null and 1b means not. Eight bits forms a byte, if there are less than eight bytes at the end, we use one more byte (8 bits). They are stored at the beginning of each row. Take a 10 columns table for an example, we begin each row with two bytes(16 bits). If in one row, the first column and the 10th column are null, then we will store 01111111b and 10111111b, which are 127 and 191 decimal numbers.
2.2/ Null fields in container types and complex types, such as list, map and struct. Similarly, we use a single bit to represent whether each element is null or not. For recursive data, such as a list of list, we store those bytes at each level. We store some bytes at the beginning of the list to indicate whether each list element is null. At the beginning of each list element, which is another list, we store some bytes too to indicate whether its elements are null or not.
2.3/ For elements that are null, we do not store them.
2.4/ For int and long primary types, we store them with the varied sized int and varied size long, such as vint and vlong in the WritableUtils in hadoop.
2.5/ For other primary types, including double, float, Boolean, byte and short, store them in binary format. For example, Boolean takes one byte and double takes eight bytes.
2.6/ For String, we first store its size as an vint, then followed by all the string bytes. For an empty string, we just store it size.
2.7/ For List, we first store its size as an vint, then followed by the bytes representing whether the fields are null or not. Then the real elements are stored. For an empty list, we just store its size.
2.8/ For Map, we first store its size as an vint, then followed by the bytes representing whether the keys and values are null or not. Each pair of key and value requires two consecutive bits. So there are twice as many bits as the size of the map. The key-value pairs are stored afterwards. For an empty map, we just store its size.
2.9/ For Struct, we first store the bytes representing whether each filed is null or not. The we will store the real data fields.
3/ We will use the standard writable object inspector.
4/ We will use the BytesWritable class as the serialization class.