Description
I came across with the following problem with min-max statistics while writing test cases for ORC with Spark (latest master). I created an table stored as ORC with a single decimal field, added a couple of negative number to this table, and used ORC tools to print the details of the ORC file created. I noticed that despite the minimum value was correct, the maximum was 0 (instead of the largest negative number added). To better understand the problem, here is a unit test to demonstrate it:
@Test public void testDecimalMinMaxStatistics() throws Exception { TypeDescription schema = TypeDescription.createDecimal() .withScale(2).withPrecision(7); Writer writer = OrcFile.createWriter(testFilePath, OrcFile.writerOptions(conf).setSchema(schema).stripeSize(100000) .bufferSize(10000)); VectorizedRowBatch batch = new VectorizedRowBatch(1, 1024); DecimalColumnVector decimalColumnVector = new DecimalColumnVector(7, 2); batch.cols[0] = decimalColumnVector; batch.reset(); batch.size = 2; decimalColumnVector.set(0, new HiveDecimalWritable("-99999.99")); decimalColumnVector.set(1, new HiveDecimalWritable("-88888.88")); writer.addRowBatch(batch); writer.close(); Reader reader = OrcFile.createReader(testFilePath, OrcFile.readerOptions(conf).filesystem(fs)); DecimalColumnStatistics statistics = (DecimalColumnStatistics) reader.getStatistics()[0]; assertEquals("Incorrect maximum value", new BigDecimal("-99999.99"), statistics.getMinimum().bigDecimalValue()); assertEquals("Incorrect minimum value", new BigDecimal("-88888.88"), statistics.getMaximum().bigDecimalValue()); }
Note, that this test fails only on 1.5 and master, and passes on 1.4 branch. Am I doing something wrong here? If this is indeed a bug, I don't think this causes correctness problems, but might be source of performance regression in case min-max stats are used with predicate pushdown.
Attachments
Issue Links
- links to