Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Current C++ ORC writer implementation has two issues about column statistics.
1. A new batch may override previous batch's has_null info of colIndexStatistics if the new batch has no null but the previous batch has at least one null values.
bool hasNull = false; if (!structBatch->hasNulls) { colIndexStatistics->increase(numValues); } else { const char* notNull = structBatch->notNull.data() + offset; for (uint64_t i = 0; i < numValues; ++i) { if (notNull[i]) { colIndexStatistics->increase(1); } else if (!hasNull) { hasNull = true; } } } colIndexStatistics->setHasNull(hasNull);
2. If ColumnStatistics does not have any not-null data, it has no sum/min/max infos and this results in writing generic but not type-specific ColumnStatistics in the protobuf serialization. The problem is that reader will have a hard time to deserialize the ColumnStatistics correctly.
void toProtoBuf(proto::ColumnStatistics& pbStats) const override { pbStats.set_hasnull(_stats.hasNull()); pbStats.set_numberofvalues(_stats.getNumberOfValues()); if (_stats.hasMinimum()) { proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics(); dateStatistics->set_maximum(_stats.getMaximum()); dateStatistics->set_minimum(_stats.getMinimum()); } }
The scope of this Jira is to fix these two problems.
Attachments
Issue Links
- links to