Given that unaligned memory accesses have been getting faster on modern architectures, we should revisit our tuple memory layout which adds padding to avoid unaligned accesses.
The code for computing our mem layout is in TupldDescriptor.java, and changes in the layout need to be reflected in descriptors.cc TupleDescriptor::GenerateLlvmStruct().
I did a simple experiment (diff attached), where we switch to a packed layout, and the results look encouraging.
Perf results vs. cdh5-trunk are here:
I think we could further optimize the layout by organizing the slots in descending order of their size, and by putting the null bits last. We could also pack var-len slots into 12 bytes (4 byte len + ptr) instead of 16 (4 byte len padded + ptr)