Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Private Beta
-
None
-
None
Description
A simple compression scheme that turns out to be both effective and fast for arrays of numeric data is the following:
1) "shuffle" the data so that all of the MSBs come first, then all of the second bit, then all the third, etc. (this is sort of like VBP from BitWeaving (http://pages.cs.wisc.edu/~jignesh/publ/BitWeaving.pdf))
2) compress the resulting data with an LZ-based compressor like LZ4
According to http://arxiv.org/pdf/1503.00638.pdf this has compression ratios similar to gzip for some datasets, but way faster (the shuffle costs very little when optimized with AVX2). This also works generically for ints of any size as well as floats.
There's also a potential longer term benefit that predicates can be evaluated on the "shuffled" form very efficiently (per the BitWeaving paper above)