Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-415

[C++] Fix writing ColumnStatistics

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • C++
    • None

    Description

      Current C++ ORC writer implementation has two issues about column statistics.

      1. A new batch may override previous batch's has_null info of colIndexStatistics if the new batch has no null but the previous batch has at least one null values.

      bool hasNull = false;
      if (!structBatch->hasNulls) {
        colIndexStatistics->increase(numValues);
      } else {
        const char* notNull = structBatch->notNull.data() + offset;
        for (uint64_t i = 0; i < numValues; ++i) {
          if (notNull[i]) {
            colIndexStatistics->increase(1);
          } else if (!hasNull) {
            hasNull = true;
          }
        }
      }
      colIndexStatistics->setHasNull(hasNull);

      2. If ColumnStatistics does not have any not-null data, it has no sum/min/max infos and this results in writing generic but not type-specific ColumnStatistics in the protobuf serialization. The problem is that reader will have a hard time to deserialize the ColumnStatistics correctly.

      void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
        pbStats.set_hasnull(_stats.hasNull());
        pbStats.set_numberofvalues(_stats.getNumberOfValues());
        if (_stats.hasMinimum()) {
          proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
          dateStatistics->set_maximum(_stats.getMaximum());
          dateStatistics->set_minimum(_stats.getMinimum());
        }
      }
      

       

      The scope of this Jira is to fix these two problems.

      Attachments

        Issue Links

          Activity

            People

              wgtmac Gang Wu
              wgtmac Gang Wu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: