IMPALA-4123: Fast bit unpacking
Adds utility functions for fast unpacking of batches of bit-packed
values. These support reading batches of any number of values provided
that the start of the batch is aligned to a byte boundary. Callers that
want to read smaller batches that don't align to byte boundaries will
need to implement their own buffering.
The unpacking code uses only portable C++ and no SIMD intrinsics, but is
fairly efficient because unpacking a full batch of 32 values compiles
down to 32-bit loads, shifts by constants, masks by constants, bitwise
ors when a value straddles 32-bit words and stores. Further speedups
should be possible using SIMD intrinsics.
Added unit tests for unpacking, exhaustively covering different
bitwidths with additional test dimensions (memory alignment, various
input sizes, etc).
Tested under ASAN to ensure the bit unpacking doesn't read past the end
Added microbenchmark that shows on average an 8-9x speedup over the
existing BitReader code.
Reviewed-by: Tim Armstrong <firstname.lastname@example.org>
Tested-by: Internal Jenkins