Details
-
Bug
-
Status: Open
-
Trivial
-
Resolution: Unresolved
-
None
-
None
-
None
Description
According to ORC spec doc, "1000 nanoseconds would be serialized as 0x0b and 100000 would be serialized as 0x0d."
However, the actual encoding result are: formatNano(1000) = 0x0a and formatNano(100000) = 0x0c.
How about changing the document as below?
"Because the number of nanoseconds often has a large number of trailing zeros, the number has trailing decimal zero digits removed and the last three bits are used to record how many zeros were removed if the trailing zeros are more than 2. Thus 1000 nanoseconds would be serialized as 0x0a and 100000 would be serialized as 0x0c."
Below is my test and result to confirm nanoseconds encodings.
// this is the ORC's serialization code in ColumnWriter.cc, ORC encodes nanoseconds by this function. // https://github.com/apache/orc/blob/master/c%2B%2B/src/ColumnWriter.cc#L1669 static int64_t formatNano(int64_t nanos) { if (nanos == 0) { return 0; } else if (nanos % 100 != 0) { return (nanos) << 3; } else { nanos /= 100; int64_t trailingZeros = 1; while (nanos % 10 == 0 && trailingZeros < 7) { nanos /= 10; trailingZeros += 1; } return (nanos) << 3 | trailingZeros; } } void main() { for (int nano = 1; nano <= 1000000; nano *= 10) { printf("formatNano(%d) = 0x%02x\n", nano, formatNano(nano)); } }
The result:
formatNano(1) = 0x08 formatNano(10) = 0x50 formatNano(100) = 0x09 formatNano(1000) = 0x0a formatNano(10000) = 0x0b formatNano(100000) = 0x0c formatNano(1000000) = 0x0d