Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-161

Create a new column type that run-length-encodes decimals

    XMLWordPrintableJSON

    Details

    • Type: Wish
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: encoding
    • Labels:
      None

      Description

      I'm storing prices in ORC format, and have made the following observations about the current decimal implementation:

      • The encoding is inefficient: my prices are a walking-random set, plus or minus a few pennies per data point. This would encode beautifully with a patched base encoding. Instead I'm averaging 4 bytes per data point, after Zlib.
      • Everyone acknowledges that it's nice to be able to store huge numbers in decimal columns, but that you probably won't. Presto, for instance, has a fast-path which engages for precision of 18 or less, and decodes to 64-bit longs, and then a slow path which uses BigInt. I anticipate the majority of implementations fit the decimal(18,6) use case.
      • The whole concept of precision/scale, along with a dedicated scale per data point is messy. Sometimes it's checked on data ingest, other times its an error on reading, or else it's cast (and rounded?)

      I don't propose eliminating the current column type. It's nice to know there's a way to store really big numbers (or really accurate numbers) if I need that in the future.

      But I'd like to see a new column that uses the existing Run Length Encoding functionality, and is limited to 63+1 bit numbers, with a fixed precision and scale for ingest and query.

      I think one could call this FixedPoint. Every number is stored as a long, and scaled by a column constant. Ingest from decimal would scale and throw or round, configurably. Precision would be fixed at 18, or made configurable and verified at ingest. Stats would use longs (scaled with the column) rather than strings.

      Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits of precision. Or they can keep using decimal if they need 128 bits. Win/win?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ddrinka Douglas Drinka
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m