Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-736

Implement BLOSC or bitshuffle-based compression scheme for numeric data

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Private Beta
    • None
    • cfile
    • None

    Description

      A simple compression scheme that turns out to be both effective and fast for arrays of numeric data is the following:
      1) "shuffle" the data so that all of the MSBs come first, then all of the second bit, then all the third, etc. (this is sort of like VBP from BitWeaving (http://pages.cs.wisc.edu/~jignesh/publ/BitWeaving.pdf))
      2) compress the resulting data with an LZ-based compressor like LZ4

      According to http://arxiv.org/pdf/1503.00638.pdf this has compression ratios similar to gzip for some datasets, but way faster (the shuffle costs very little when optimized with AVX2). This also works generically for ints of any size as well as floats.

      There's also a potential longer term benefit that predicates can be evaluated on the "shuffled" form very efficiently (per the BitWeaving paper above)

      Attachments

        Activity

          People

            guangxiang.du Guangxiang Du
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: