Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-87

[C++] Handle missing timezone conversion for timestamp statistics

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.4, 1.4.0
    • Component/s: C++
    • Labels:
      None

      Description

      The recent release adds timezone to timestamp values but does not add the conversion for timestamp statistics.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user majetideepak opened a pull request:

          https://github.com/apache/orc/pull/110

          ORC-87: Handle missing timezone conversion for timestamp statistics

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/majetideepak/orc ORC-87

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/110.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #110


          commit 16c73efed602957c7561e741d78f2e39f19502c2
          Author: Deepak Majeti <deepak.majeti@hpe.com>
          Date: 2017-04-18T20:59:44Z

          Fix timestamp statistics

          commit d8ec95a0b277fae37f933480eaa53cb7fb33764b
          Author: Deepak Majeti <deepak.majeti@hpe.com>
          Date: 2017-04-18T21:21:38Z

          pretty print statistics

          commit cf70e91d4942df6a8e18bcaed289334aa98a668a
          Author: Deepak Majeti <deepak.majeti@hpe.com>
          Date: 2017-04-18T23:47:51Z

          Fix stats

          commit 0c03fe700432c3a22db63c1e209185344ff4ee85
          Author: Deepak Majeti <deepak.majeti@hpe.com>
          Date: 2017-04-19T02:16:18Z

          Add example

          commit 4ed4b4acb41ace3a158405ceb0f1ba1c26b0d7ae
          Author: Deepak Majeti <deepak.majeti@hpe.com>
          Date: 2017-04-19T02:23:25Z

          improve test


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/110 ORC-87 : Handle missing timezone conversion for timestamp statistics You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-87 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #110 commit 16c73efed602957c7561e741d78f2e39f19502c2 Author: Deepak Majeti <deepak.majeti@hpe.com> Date: 2017-04-18T20:59:44Z Fix timestamp statistics commit d8ec95a0b277fae37f933480eaa53cb7fb33764b Author: Deepak Majeti <deepak.majeti@hpe.com> Date: 2017-04-18T21:21:38Z pretty print statistics commit cf70e91d4942df6a8e18bcaed289334aa98a668a Author: Deepak Majeti <deepak.majeti@hpe.com> Date: 2017-04-18T23:47:51Z Fix stats commit 0c03fe700432c3a22db63c1e209185344ff4ee85 Author: Deepak Majeti <deepak.majeti@hpe.com> Date: 2017-04-19T02:16:18Z Add example commit 4ed4b4acb41ace3a158405ceb0f1ba1c26b0d7ae Author: Deepak Majeti <deepak.majeti@hpe.com> Date: 2017-04-19T02:23:25Z improve test
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r113301215

          — Diff: c++/src/Reader.cc —
          @@ -526,16 +526,25 @@ namespace orc {
          if (metadata.get() == nullptr)

          { throw std::logic_error("No stripe statistics in file"); }

          + auto currentStripeInfo = footer->stripes(static_cast<int>(stripeIndex));
          + auto currentStripeFooter = getStripeFooter(currentStripeInfo, *contents.get());
          + const Timezone& writerTZ =
          + currentStripeFooter.has_writertimezone() ?
          + getTimezoneByName(currentStripeFooter.writertimezone()) :
          + getLocalTimezone();
          + StatContext statContext;
          + statContext.correctStats = hasCorrectStatistics();
          + statContext.writerTimezone = &writerTZ;
          return std::unique_ptr<Statistics>
          (new StatisticsImpl(metadata->stripestats

          • (static_cast<int>(stripeIndex)),
          • hasCorrectStatistics()));
            + (static_cast<int>(stripeIndex)), statContext));
            }

          std::unique_ptr<Statistics> ReaderImpl::getStatistics() const {
          + StatContext statContext;
          + statContext.correctStats = hasCorrectStatistics();
          — End diff –

          You need the timezone here too don't you?

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r113301215 — Diff: c++/src/Reader.cc — @@ -526,16 +526,25 @@ namespace orc { if (metadata.get() == nullptr) { throw std::logic_error("No stripe statistics in file"); } + auto currentStripeInfo = footer->stripes(static_cast<int>(stripeIndex)); + auto currentStripeFooter = getStripeFooter(currentStripeInfo, *contents.get()); + const Timezone& writerTZ = + currentStripeFooter.has_writertimezone() ? + getTimezoneByName(currentStripeFooter.writertimezone()) : + getLocalTimezone(); + StatContext statContext; + statContext.correctStats = hasCorrectStatistics(); + statContext.writerTimezone = &writerTZ; return std::unique_ptr<Statistics> (new StatisticsImpl(metadata->stripestats (static_cast<int>(stripeIndex)), hasCorrectStatistics())); + (static_cast<int>(stripeIndex)), statContext)); } std::unique_ptr<Statistics> ReaderImpl::getStatistics() const { + StatContext statContext; + statContext.correctStats = hasCorrectStatistics(); — End diff – You need the timezone here too don't you?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r113301529

          — Diff: c++/src/Statistics.hh —
          @@ -23,10 +23,17 @@
          #include "orc/OrcFile.hh"
          #include "orc/Reader.hh"

          +#include "Timezone.hh"
          #include "TypeImpl.hh"

          namespace orc {

          + struct StatContext {
          + bool correctStats;
          + const Timezone* writerTimezone;
          + StatContext() : correctStats(false), writerTimezone(NULL) {}
          — End diff –

          Shouldn't the constructor take the correctStats and writerTimezone and make them const?

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r113301529 — Diff: c++/src/Statistics.hh — @@ -23,10 +23,17 @@ #include "orc/OrcFile.hh" #include "orc/Reader.hh" +#include "Timezone.hh" #include "TypeImpl.hh" namespace orc { + struct StatContext { + bool correctStats; + const Timezone* writerTimezone; + StatContext() : correctStats(false), writerTimezone(NULL) {} — End diff – Shouldn't the constructor take the correctStats and writerTimezone and make them const?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r113300916

          — Diff: c++/src/Statistics.cc —
          @@ -279,20 +277,28 @@ namespace orc {
          }

          TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl

          • (const proto::ColumnStatistics& pb, bool correctStats) {
            + (const proto::ColumnStatistics& pb, const StatContext& statContext) {
            valueCount = pb.numberofvalues();
          • if (!pb.has_timestampstatistics() || !correctStats) {
            + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; minimum = 0; maximum = 0; }

            else{
            const proto::TimestampStatistics& stats = pb.timestampstatistics();

          • _hasMinimum = stats.has_minimum();
          • _hasMaximum = stats.has_maximum();
            -
          • minimum = stats.minimum();
          • maximum = stats.maximum();
            + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && statContext.writerTimezone);
            + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && statContext.writerTimezone);
            +
            + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + }

            else

            { + minimum = stats.minimum() + statContext.writerTimezone->getEpoch(); + }

            + if (stats.has_minimumutc()) {

              • End diff –

          You mean has_maximumutc() here.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r113300916 — Diff: c++/src/Statistics.cc — @@ -279,20 +277,28 @@ namespace orc { } TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl (const proto::ColumnStatistics& pb, bool correctStats) { + (const proto::ColumnStatistics& pb, const StatContext& statContext) { valueCount = pb.numberofvalues(); if (!pb.has_timestampstatistics() || !correctStats) { + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; minimum = 0; maximum = 0; } else{ const proto::TimestampStatistics& stats = pb.timestampstatistics(); _hasMinimum = stats.has_minimum(); _hasMaximum = stats.has_maximum(); - minimum = stats.minimum(); maximum = stats.maximum(); + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && statContext.writerTimezone); + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && statContext.writerTimezone); + + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + } else { + minimum = stats.minimum() + statContext.writerTimezone->getEpoch(); + } + if (stats.has_minimumutc()) { End diff – You mean has_maximumutc() here.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r113462343

          — Diff: c++/src/Reader.cc —
          @@ -526,16 +526,25 @@ namespace orc {
          if (metadata.get() == nullptr)

          { throw std::logic_error("No stripe statistics in file"); }

          + auto currentStripeInfo = footer->stripes(static_cast<int>(stripeIndex));
          + auto currentStripeFooter = getStripeFooter(currentStripeInfo, *contents.get());
          + const Timezone& writerTZ =
          + currentStripeFooter.has_writertimezone() ?
          + getTimezoneByName(currentStripeFooter.writertimezone()) :
          + getLocalTimezone();
          + StatContext statContext;
          + statContext.correctStats = hasCorrectStatistics();
          + statContext.writerTimezone = &writerTZ;
          return std::unique_ptr<Statistics>
          (new StatisticsImpl(metadata->stripestats

          • (static_cast<int>(stripeIndex)),
          • hasCorrectStatistics()));
            + (static_cast<int>(stripeIndex)), statContext));
            }

          std::unique_ptr<Statistics> ReaderImpl::getStatistics() const {
          + StatContext statContext;
          + statContext.correctStats = hasCorrectStatistics();
          — End diff –

          `getStatistics()` is returning the statistics in the file footer. As far as I understood, there is no timezone information available in the file footer. Looks like the timezone is stored for each stripe.
          I am also curious about this design choice. I would imagine the file footer to have the timezone information. Is it possible for different stripes to have different timezone values?

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r113462343 — Diff: c++/src/Reader.cc — @@ -526,16 +526,25 @@ namespace orc { if (metadata.get() == nullptr) { throw std::logic_error("No stripe statistics in file"); } + auto currentStripeInfo = footer->stripes(static_cast<int>(stripeIndex)); + auto currentStripeFooter = getStripeFooter(currentStripeInfo, *contents.get()); + const Timezone& writerTZ = + currentStripeFooter.has_writertimezone() ? + getTimezoneByName(currentStripeFooter.writertimezone()) : + getLocalTimezone(); + StatContext statContext; + statContext.correctStats = hasCorrectStatistics(); + statContext.writerTimezone = &writerTZ; return std::unique_ptr<Statistics> (new StatisticsImpl(metadata->stripestats (static_cast<int>(stripeIndex)), hasCorrectStatistics())); + (static_cast<int>(stripeIndex)), statContext)); } std::unique_ptr<Statistics> ReaderImpl::getStatistics() const { + StatContext statContext; + statContext.correctStats = hasCorrectStatistics(); — End diff – `getStatistics()` is returning the statistics in the file footer. As far as I understood, there is no timezone information available in the file footer. Looks like the timezone is stored for each stripe. I am also curious about this design choice. I would imagine the file footer to have the timezone information. Is it possible for different stripes to have different timezone values?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r113462614

          — Diff: c++/src/Statistics.cc —
          @@ -279,20 +277,28 @@ namespace orc {
          }

          TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl

          • (const proto::ColumnStatistics& pb, bool correctStats) {
            + (const proto::ColumnStatistics& pb, const StatContext& statContext) {
            valueCount = pb.numberofvalues();
          • if (!pb.has_timestampstatistics() || !correctStats) {
            + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; minimum = 0; maximum = 0; }

            else{
            const proto::TimestampStatistics& stats = pb.timestampstatistics();

          • _hasMinimum = stats.has_minimum();
          • _hasMaximum = stats.has_maximum();
            -
          • minimum = stats.minimum();
          • maximum = stats.maximum();
            + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && statContext.writerTimezone);
            + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && statContext.writerTimezone);
            +
            + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + }

            else

            { + minimum = stats.minimum() + statContext.writerTimezone->getEpoch(); + }

            + if (stats.has_minimumutc()) {

              • End diff –

          You are right. Will fix this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r113462614 — Diff: c++/src/Statistics.cc — @@ -279,20 +277,28 @@ namespace orc { } TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl (const proto::ColumnStatistics& pb, bool correctStats) { + (const proto::ColumnStatistics& pb, const StatContext& statContext) { valueCount = pb.numberofvalues(); if (!pb.has_timestampstatistics() || !correctStats) { + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; minimum = 0; maximum = 0; } else{ const proto::TimestampStatistics& stats = pb.timestampstatistics(); _hasMinimum = stats.has_minimum(); _hasMaximum = stats.has_maximum(); - minimum = stats.minimum(); maximum = stats.maximum(); + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && statContext.writerTimezone); + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && statContext.writerTimezone); + + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + } else { + minimum = stats.minimum() + statContext.writerTimezone->getEpoch(); + } + if (stats.has_minimumutc()) { End diff – You are right. Will fix this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r113467717

          — Diff: c++/src/Statistics.hh —
          @@ -23,10 +23,17 @@
          #include "orc/OrcFile.hh"
          #include "orc/Reader.hh"

          +#include "Timezone.hh"
          #include "TypeImpl.hh"

          namespace orc {

          + struct StatContext {
          + bool correctStats;
          + const Timezone* writerTimezone;
          + StatContext() : correctStats(false), writerTimezone(NULL) {}
          — End diff –

          We can change `pointer to const` later too.
          But you are right that it is more efficient to use const for the values too. Will change this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r113467717 — Diff: c++/src/Statistics.hh — @@ -23,10 +23,17 @@ #include "orc/OrcFile.hh" #include "orc/Reader.hh" +#include "Timezone.hh" #include "TypeImpl.hh" namespace orc { + struct StatContext { + bool correctStats; + const Timezone* writerTimezone; + StatContext() : correctStats(false), writerTimezone(NULL) {} — End diff – We can change `pointer to const` later too. But you are right that it is more efficient to use const for the values too. Will change this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/110

          I think we should probably add new methods for timestamp statistics for getLowerBound and getUpperBound that will be the same for the new version of the statistics, but be a looser bound in the case of the older files.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/110 I think we should probably add new methods for timestamp statistics for getLowerBound and getUpperBound that will be the same for the new version of the statistics, but be a looser bound in the case of the older files.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on the issue:

          https://github.com/apache/orc/pull/110

          @omalley I have added methods for timestamp statistics for getLowerBound an getUpperBound

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/110 @omalley I have added methods for timestamp statistics for getLowerBound an getUpperBound
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r114196797

          — Diff: c++/src/Statistics.cc —
          @@ -279,20 +277,55 @@ namespace orc {
          }

          TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl

          • (const proto::ColumnStatistics& pb, bool correctStats) {
            + (const proto::ColumnStatistics& pb, const StatContext& statContext) {
            valueCount = pb.numberofvalues();
          • if (!pb.has_timestampstatistics() || !correctStats) {
            + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; }

            else{
            const proto::TimestampStatistics& stats = pb.timestampstatistics();

          • _hasMinimum = stats.has_minimum();
          • _hasMaximum = stats.has_maximum();
            -
          • minimum = stats.minimum();
          • maximum = stats.maximum();
            + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL));
            + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL));
            + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum();
            + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum();
            +
            + // Timestamp stats are stored in milliseconds
            + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + }

            else if (statContext.writerTimezone)

            { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + }

            else

            { + minimum = 0; + // subtract 1 day to handle unknown TZ + lowerBound = stats.minimum() - (24 * 60 * 60 * 1000); + }

            +
            + // Timestamp stats are stored in milliseconds
            + if (stats.has_maximumutc())

            { + maximum = stats.maximumutc(); + upperBound = maximum; + }

            else if (statContext.writerTimezone)

            { + int64_t writerTimeSec = stats.maximum() / 1000; + // multiply the offset by 1000 to convert to millisecond + maximum = stats.maximum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + upperBound = maximum; + }

            else

            { + maximum = 0; + // add 1 day to handle unknown TZ + upperBound = stats.maximum() + (SECONDS_PER_DAY * 1000); + }

            + // Add 1 millisecond to account for microsecond precision of values
            + maximum += 1;

              • End diff –

          I don't understand this and the unit tests pass without it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r114196797 — Diff: c++/src/Statistics.cc — @@ -279,20 +277,55 @@ namespace orc { } TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl (const proto::ColumnStatistics& pb, bool correctStats) { + (const proto::ColumnStatistics& pb, const StatContext& statContext) { valueCount = pb.numberofvalues(); if (!pb.has_timestampstatistics() || !correctStats) { + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; } else{ const proto::TimestampStatistics& stats = pb.timestampstatistics(); _hasMinimum = stats.has_minimum(); _hasMaximum = stats.has_maximum(); - minimum = stats.minimum(); maximum = stats.maximum(); + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL)); + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL)); + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum(); + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum(); + + // Timestamp stats are stored in milliseconds + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + } else if (statContext.writerTimezone) { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + } else { + minimum = 0; + // subtract 1 day to handle unknown TZ + lowerBound = stats.minimum() - (24 * 60 * 60 * 1000); + } + + // Timestamp stats are stored in milliseconds + if (stats.has_maximumutc()) { + maximum = stats.maximumutc(); + upperBound = maximum; + } else if (statContext.writerTimezone) { + int64_t writerTimeSec = stats.maximum() / 1000; + // multiply the offset by 1000 to convert to millisecond + maximum = stats.maximum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + upperBound = maximum; + } else { + maximum = 0; + // add 1 day to handle unknown TZ + upperBound = stats.maximum() + (SECONDS_PER_DAY * 1000); + } + // Add 1 millisecond to account for microsecond precision of values + maximum += 1; End diff – I don't understand this and the unit tests pass without it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r114200187

          — Diff: c++/src/Statistics.cc —
          @@ -279,20 +277,55 @@ namespace orc {
          }

          TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl

          • (const proto::ColumnStatistics& pb, bool correctStats) {
            + (const proto::ColumnStatistics& pb, const StatContext& statContext) {
            valueCount = pb.numberofvalues();
          • if (!pb.has_timestampstatistics() || !correctStats) {
            + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; }

            else{
            const proto::TimestampStatistics& stats = pb.timestampstatistics();

          • _hasMinimum = stats.has_minimum();
          • _hasMaximum = stats.has_maximum();
            -
          • minimum = stats.minimum();
          • maximum = stats.maximum();
            + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL));
            + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL));
            + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum();
            + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum();
            +
            + // Timestamp stats are stored in milliseconds
            + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + }

            else if (statContext.writerTimezone)

            { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + }

            else

            { + minimum = 0; + // subtract 1 day to handle unknown TZ + lowerBound = stats.minimum() - (24 * 60 * 60 * 1000); + }

            +
            + // Timestamp stats are stored in milliseconds
            + if (stats.has_maximumutc())

            { + maximum = stats.maximumutc(); + upperBound = maximum; + }

            else if (statContext.writerTimezone)

            { + int64_t writerTimeSec = stats.maximum() / 1000; + // multiply the offset by 1000 to convert to millisecond + maximum = stats.maximum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + upperBound = maximum; + }

            else

            { + maximum = 0; + // add 1 day to handle unknown TZ + upperBound = stats.maximum() + (SECONDS_PER_DAY * 1000); + }

            + // Add 1 millisecond to account for microsecond precision of values
            + maximum += 1;

              • End diff –

          The timestamp values have microsecond precision `eg: 2037-01-01 00:00:00.000999`.
          However, I see that the statistics computed have millisecond precision `2037-01-01 00:00:00.000` and they are not rounded off to the ceil. This is to account for the ceil. I will update the test.
          Thanks! for catching this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r114200187 — Diff: c++/src/Statistics.cc — @@ -279,20 +277,55 @@ namespace orc { } TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl (const proto::ColumnStatistics& pb, bool correctStats) { + (const proto::ColumnStatistics& pb, const StatContext& statContext) { valueCount = pb.numberofvalues(); if (!pb.has_timestampstatistics() || !correctStats) { + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; } else{ const proto::TimestampStatistics& stats = pb.timestampstatistics(); _hasMinimum = stats.has_minimum(); _hasMaximum = stats.has_maximum(); - minimum = stats.minimum(); maximum = stats.maximum(); + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL)); + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL)); + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum(); + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum(); + + // Timestamp stats are stored in milliseconds + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + } else if (statContext.writerTimezone) { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + } else { + minimum = 0; + // subtract 1 day to handle unknown TZ + lowerBound = stats.minimum() - (24 * 60 * 60 * 1000); + } + + // Timestamp stats are stored in milliseconds + if (stats.has_maximumutc()) { + maximum = stats.maximumutc(); + upperBound = maximum; + } else if (statContext.writerTimezone) { + int64_t writerTimeSec = stats.maximum() / 1000; + // multiply the offset by 1000 to convert to millisecond + maximum = stats.maximum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + upperBound = maximum; + } else { + maximum = 0; + // add 1 day to handle unknown TZ + upperBound = stats.maximum() + (SECONDS_PER_DAY * 1000); + } + // Add 1 millisecond to account for microsecond precision of values + maximum += 1; End diff – The timestamp values have microsecond precision `eg: 2037-01-01 00:00:00.000999`. However, I see that the statistics computed have millisecond precision `2037-01-01 00:00:00.000` and they are not rounded off to the ceil. This is to account for the ceil. I will update the test. Thanks! for catching this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r114225921

          — Diff: c++/src/Statistics.cc —
          @@ -279,20 +277,55 @@ namespace orc {
          }

          TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl

          • (const proto::ColumnStatistics& pb, bool correctStats) {
            + (const proto::ColumnStatistics& pb, const StatContext& statContext) {
            valueCount = pb.numberofvalues();
          • if (!pb.has_timestampstatistics() || !correctStats) {
            + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; }

            else{
            const proto::TimestampStatistics& stats = pb.timestampstatistics();

          • _hasMinimum = stats.has_minimum();
          • _hasMaximum = stats.has_maximum();
            -
          • minimum = stats.minimum();
          • maximum = stats.maximum();
            + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL));
            + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL));
            + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum();
            + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum();
            +
            + // Timestamp stats are stored in milliseconds
            + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + }

            else if (statContext.writerTimezone)

            { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + }

            else

            { + minimum = 0; + // subtract 1 day to handle unknown TZ + lowerBound = stats.minimum() - (24 * 60 * 60 * 1000); + }

            +
            + // Timestamp stats are stored in milliseconds
            + if (stats.has_maximumutc())

            { + maximum = stats.maximumutc(); + upperBound = maximum; + }

            else if (statContext.writerTimezone)

            { + int64_t writerTimeSec = stats.maximum() / 1000; + // multiply the offset by 1000 to convert to millisecond + maximum = stats.maximum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + upperBound = maximum; + }

            else

            { + maximum = 0; + // add 1 day to handle unknown TZ + upperBound = stats.maximum() + (SECONDS_PER_DAY * 1000); + }

            + // Add 1 millisecond to account for microsecond precision of values
            + maximum += 1;

              • End diff –

          Actually, it should push up the upperBound, but maximum is less confusing as it was.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r114225921 — Diff: c++/src/Statistics.cc — @@ -279,20 +277,55 @@ namespace orc { } TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl (const proto::ColumnStatistics& pb, bool correctStats) { + (const proto::ColumnStatistics& pb, const StatContext& statContext) { valueCount = pb.numberofvalues(); if (!pb.has_timestampstatistics() || !correctStats) { + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; } else{ const proto::TimestampStatistics& stats = pb.timestampstatistics(); _hasMinimum = stats.has_minimum(); _hasMaximum = stats.has_maximum(); - minimum = stats.minimum(); maximum = stats.maximum(); + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL)); + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL)); + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum(); + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum(); + + // Timestamp stats are stored in milliseconds + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + } else if (statContext.writerTimezone) { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + } else { + minimum = 0; + // subtract 1 day to handle unknown TZ + lowerBound = stats.minimum() - (24 * 60 * 60 * 1000); + } + + // Timestamp stats are stored in milliseconds + if (stats.has_maximumutc()) { + maximum = stats.maximumutc(); + upperBound = maximum; + } else if (statContext.writerTimezone) { + int64_t writerTimeSec = stats.maximum() / 1000; + // multiply the offset by 1000 to convert to millisecond + maximum = stats.maximum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + upperBound = maximum; + } else { + maximum = 0; + // add 1 day to handle unknown TZ + upperBound = stats.maximum() + (SECONDS_PER_DAY * 1000); + } + // Add 1 millisecond to account for microsecond precision of values + maximum += 1; End diff – Actually, it should push up the upperBound, but maximum is less confusing as it was.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wgtmac commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r114248388

          — Diff: c++/src/Reader.cc —
          @@ -526,16 +526,22 @@ namespace orc {
          if (metadata.get() == nullptr)

          { throw std::logic_error("No stripe statistics in file"); }

          + auto currentStripeInfo = footer->stripes(static_cast<int>(stripeIndex));
          — End diff –

          Should we consider some older compliers which don't support c++11 very well? I don't see anywhere else is using keyword auto.

          Show
          githubbot ASF GitHub Bot added a comment - Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r114248388 — Diff: c++/src/Reader.cc — @@ -526,16 +526,22 @@ namespace orc { if (metadata.get() == nullptr) { throw std::logic_error("No stripe statistics in file"); } + auto currentStripeInfo = footer->stripes(static_cast<int>(stripeIndex)); — End diff – Should we consider some older compliers which don't support c++11 very well? I don't see anywhere else is using keyword auto.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majetideepak commented on a diff in the pull request:

          https://github.com/apache/orc/pull/110#discussion_r114351170

          — Diff: c++/src/Statistics.cc —
          @@ -279,20 +277,55 @@ namespace orc {
          }

          TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl

          • (const proto::ColumnStatistics& pb, bool correctStats) {
            + (const proto::ColumnStatistics& pb, const StatContext& statContext) {
            valueCount = pb.numberofvalues();
          • if (!pb.has_timestampstatistics() || !correctStats) {
            + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; }

            else{
            const proto::TimestampStatistics& stats = pb.timestampstatistics();

          • _hasMinimum = stats.has_minimum();
          • _hasMaximum = stats.has_maximum();
            -
          • minimum = stats.minimum();
          • maximum = stats.maximum();
            + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL));
            + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL));
            + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum();
            + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum();
            +
            + // Timestamp stats are stored in milliseconds
            + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + }

            else if (statContext.writerTimezone)

            { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + }

            else {
            + minimum = 0;
            + // subtract 1 day 1 hour (25 hours) in milliseconds to handle unknown TZ and daylight savings

              • End diff –

          @omalley Note the change here to 25 hours. The additional 1 hour is to account for daylight savings.

          Show
          githubbot ASF GitHub Bot added a comment - Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/110#discussion_r114351170 — Diff: c++/src/Statistics.cc — @@ -279,20 +277,55 @@ namespace orc { } TimestampColumnStatisticsImpl::TimestampColumnStatisticsImpl (const proto::ColumnStatistics& pb, bool correctStats) { + (const proto::ColumnStatistics& pb, const StatContext& statContext) { valueCount = pb.numberofvalues(); if (!pb.has_timestampstatistics() || !correctStats) { + if (!pb.has_timestampstatistics() || !statContext.correctStats) { _hasMinimum = false; _hasMaximum = false; + _hasLowerBound = false; + _hasUpperBound = false; minimum = 0; maximum = 0; + lowerBound = 0; + upperBound = 0; } else{ const proto::TimestampStatistics& stats = pb.timestampstatistics(); _hasMinimum = stats.has_minimum(); _hasMaximum = stats.has_maximum(); - minimum = stats.minimum(); maximum = stats.maximum(); + _hasMinimum = stats.has_minimumutc() || (stats.has_minimum() && (statContext.writerTimezone != NULL)); + _hasMaximum = stats.has_maximumutc() || (stats.has_maximum() && (statContext.writerTimezone != NULL)); + _hasLowerBound = stats.has_minimumutc() || stats.has_minimum(); + _hasUpperBound = stats.has_maximumutc() || stats.has_maximum(); + + // Timestamp stats are stored in milliseconds + if (stats.has_minimumutc()) { + minimum = stats.minimumutc(); + lowerBound = minimum; + } else if (statContext.writerTimezone) { + int64_t writerTimeSec = stats.minimum() / 1000; + // multiply the offset by 1000 to convert to millisecond + minimum = stats.minimum() + (statContext.writerTimezone->getVariant(writerTimeSec).gmtOffset) * 1000; + lowerBound = minimum; + } else { + minimum = 0; + // subtract 1 day 1 hour (25 hours) in milliseconds to handle unknown TZ and daylight savings End diff – @omalley Note the change here to 25 hours. The additional 1 hour is to account for daylight savings.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/110

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/110
          Hide
          owen.omalley Owen O'Malley added a comment -

          I just committed this. Thanks, Deepak!

          Show
          owen.omalley Owen O'Malley added a comment - I just committed this. Thanks, Deepak!
          Hide
          owen.omalley Owen O'Malley added a comment -

          Released as part of ORC 1.4.0

          Show
          owen.omalley Owen O'Malley added a comment - Released as part of ORC 1.4.0

            People

            • Assignee:
              mdeepak Deepak Majeti
              Reporter:
              mdeepak Deepak Majeti
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development