Affects Version/s: None
Fix Version/s: None
I'm storing prices in ORC format, and have made the following observations about the current decimal implementation:
- The encoding is inefficient: my prices are a walking-random set, plus or minus a few pennies per data point. This would encode beautifully with a patched base encoding. Instead I'm averaging 4 bytes per data point, after Zlib.
- Everyone acknowledges that it's nice to be able to store huge numbers in decimal columns, but that you probably won't. Presto, for instance, has a fast-path which engages for precision of 18 or less, and decodes to 64-bit longs, and then a slow path which uses BigInt. I anticipate the majority of implementations fit the decimal(18,6) use case.
- The whole concept of precision/scale, along with a dedicated scale per data point is messy. Sometimes it's checked on data ingest, other times its an error on reading, or else it's cast (and rounded?)
I don't propose eliminating the current column type. It's nice to know there's a way to store really big numbers (or really accurate numbers) if I need that in the future.
But I'd like to see a new column that uses the existing Run Length Encoding functionality, and is limited to 63+1 bit numbers, with a fixed precision and scale for ingest and query.
I think one could call this FixedPoint. Every number is stored as a long, and scaled by a column constant. Ingest from decimal would scale and throw or round, configurably. Precision would be fixed at 18, or made configurable and verified at ingest. Stats would use longs (scaled with the column) rather than strings.
Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits of precision. Or they can keep using decimal if they need 128 bits. Win/win?