Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-135

PPD for timestamp is wrong when reader and writer timezones are different

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.0.0, 1.1.0, 1.2.0, 1.3.0
    • Fix Version/s: 1.3.2, 1.4.0
    • Component/s: None
    • Labels:
      None

      Description

      When reader and writer timezones are different, PPD evaluation does not offset the timezone when reading the min and max values. This can result is wrong PPD evaluation and hence incorrect results.

      Example:
      Table written in US/Eastern timezone. All values in this table are "2007-08-01 00:00:00.0".

      PPD disabled
      hive> set hive.optimize.index.filter=false;
      hive> select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01 00:00:00.0' limit 1;
      2007-08-01 00:00:00.0
      OK
      
      PPD enabled
      set hive.optimize.index.filter=true;
      select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01 00:00:00.0' limit 1;
      OK
      

      No rows are returned when PPD is enabled (reader timezone is UTC)

        Issue Links

          Activity

          Hide
          owen.omalley Owen O'Malley added a comment -

          I'd propose extending the file format like:

          orc_proto.proto
          message TimestampStatistics {
            // min,max values saved as milliseconds since epoch
            optional sint64 minimum = 1;
            optional sint64 maximum = 2;
            optional sint64 minimumUtc = 3;
            optional sint64 maximumUtc = 4;
          }
          

          where minimumUtc and maximumUtc are defined as <time in utc> - <epoch in utc> in milliseconds. We stop setting minimum and maximum and only set minimumUtc and maximumUtc. Old readers will not see the new min and max and new readers will ignore old values.

          Show
          owen.omalley Owen O'Malley added a comment - I'd propose extending the file format like: orc_proto.proto message TimestampStatistics { // min,max values saved as milliseconds since epoch optional sint64 minimum = 1; optional sint64 maximum = 2; optional sint64 minimumUtc = 3; optional sint64 maximumUtc = 4; } where minimumUtc and maximumUtc are defined as <time in utc> - <epoch in utc> in milliseconds. We stop setting minimum and maximum and only set minimumUtc and maximumUtc. Old readers will not see the new min and max and new readers will ignore old values.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user prasanthj opened a pull request:

          https://github.com/apache/orc/pull/87

          ORC-135: PPD for timestamp is wrong when reader and writer timezones are different

          When reader and writer timezones are different, PPD evaluation does not offset the timezone when reading the min and max values. This can result is wrong PPD evaluation and hence incorrect results.

          Example:
          Table written in US/Eastern timezone. All values in this table are "2007-08-01 00:00:00.0".
          *PPD disabled*
          ```
          hive> set hive.optimize.index.filter=false;
          hive> select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01 00:00:00.0' limit 1;
          2007-08-01 00:00:00.0
          OK
          ```

          *PPD enabled*
          ```
          set hive.optimize.index.filter=true;
          select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01 00:00:00.0' limit 1;
          OK
          ```
          No rows are returned when PPD is enabled (reader timezone is UTC)

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/prasanthj/orc ORC-135

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/87.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #87


          commit 5adda854d749afd773e1fb4e8002eafd6561d4f6
          Author: Prasanth Jayachandran <prasanthj@apache.org>
          Date: 2017-01-27T01:48:25Z

          ORC-135: PPD for timestamp is wrong when reader and writer timezones are different


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user prasanthj opened a pull request: https://github.com/apache/orc/pull/87 ORC-135 : PPD for timestamp is wrong when reader and writer timezones are different When reader and writer timezones are different, PPD evaluation does not offset the timezone when reading the min and max values. This can result is wrong PPD evaluation and hence incorrect results. Example: Table written in US/Eastern timezone. All values in this table are "2007-08-01 00:00:00.0". * PPD disabled * ``` hive> set hive.optimize.index.filter=false; hive> select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01 00:00:00.0' limit 1; 2007-08-01 00:00:00.0 OK ``` * PPD enabled * ``` set hive.optimize.index.filter=true; select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01 00:00:00.0' limit 1; OK ``` No rows are returned when PPD is enabled (reader timezone is UTC) You can merge this pull request into a Git repository by running: $ git pull https://github.com/prasanthj/orc ORC-135 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/87.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #87 commit 5adda854d749afd773e1fb4e8002eafd6561d4f6 Author: Prasanth Jayachandran <prasanthj@apache.org> Date: 2017-01-27T01:48:25Z ORC-135 : PPD for timestamp is wrong when reader and writer timezones are different
          Hide
          prasanth_j Prasanth Jayachandran added a comment -

          Following changes are made

          • TimestampColumnStatistics now stores min and max in minimumUtc and maximumUtc protobuf fields (This way old readers reading new files will return YES_NO_NULL as the statistics will be missing and new readers reading old files will have utc statistics missing and return YES_NO_NULL for PPD evaluation)
          • New encoding type is added to bloomfilter stream. Reader will skip PPD if writer version is old or if bloom filter is missing encoding kind.
          • Moved old bloom filter tests to test against different timezones

          I will create a follow up to disable bloomfilter PPD for timestamp column for the old readers.

          Show
          prasanth_j Prasanth Jayachandran added a comment - Following changes are made TimestampColumnStatistics now stores min and max in minimumUtc and maximumUtc protobuf fields (This way old readers reading new files will return YES_NO_NULL as the statistics will be missing and new readers reading old files will have utc statistics missing and return YES_NO_NULL for PPD evaluation) New encoding type is added to bloomfilter stream. Reader will skip PPD if writer version is old or if bloom filter is missing encoding kind. Moved old bloom filter tests to test against different timezones I will create a follow up to disable bloomfilter PPD for timestamp column for the old readers.
          Hide
          prasanth_j Prasanth Jayachandran added a comment -

          Owen O'Malley Can you please review the patch?

          Show
          prasanth_j Prasanth Jayachandran added a comment - Owen O'Malley Can you please review the patch?
          Hide
          prasanth_j Prasanth Jayachandran added a comment - - edited

          Created ORC-137 to disable bloomfilter PPD for old files.

          Show
          prasanth_j Prasanth Jayachandran added a comment - - edited Created ORC-137 to disable bloomfilter PPD for old files.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user prasanthj commented on a diff in the pull request:

          https://github.com/apache/orc/pull/87#discussion_r98298881

          — Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java —
          @@ -19,6 +19,9 @@

          import java.sql.Date;
          import java.sql.Timestamp;
          +import java.util.Calendar;
          +import java.util.GregorianCalendar;
          +import java.util.TimeZone;
          — End diff –

          These imports are not required. Will remove it in the next patch.

          Show
          githubbot ASF GitHub Bot added a comment - Github user prasanthj commented on a diff in the pull request: https://github.com/apache/orc/pull/87#discussion_r98298881 — Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java — @@ -19,6 +19,9 @@ import java.sql.Date; import java.sql.Timestamp; +import java.util.Calendar; +import java.util.GregorianCalendar; +import java.util.TimeZone; — End diff – These imports are not required. Will remove it in the next patch.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/87

          Ok, I took a pass at this:

          https://github.com/omalley/orc/tree/orc-135

          Changes:

          • Block all use of PPD on timestamps from before ORC-135.
          • Block PPD on timestamp with numeric constraints
          • TimestampColumnStatistics.getMinimum and getMaximum return Timestamps in the local timezone.
          • Loading old TimestampColumnStatistics loads the min and max assuming local timezone.
          • Add utilities to SerializationUtils for conversion to and from utc.
          • Make BloomFilter.encoding an int so that we can detect future values.
          • Future BloomFilter encodings are not used.
          • Some more test cases
          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/87 Ok, I took a pass at this: https://github.com/omalley/orc/tree/orc-135 Changes: Block all use of PPD on timestamps from before ORC-135 . Block PPD on timestamp with numeric constraints TimestampColumnStatistics.getMinimum and getMaximum return Timestamps in the local timezone. Loading old TimestampColumnStatistics loads the min and max assuming local timezone. Add utilities to SerializationUtils for conversion to and from utc. Make BloomFilter.encoding an int so that we can detect future values. Future BloomFilter encodings are not used. Some more test cases
          Hide
          prasanth_j Prasanth Jayachandran added a comment -

          Owen O'Malley Can you give a new PR with the new changes?

          Show
          prasanth_j Prasanth Jayachandran added a comment - Owen O'Malley Can you give a new PR with the new changes?
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user omalley opened a pull request:

          https://github.com/apache/orc/pull/90

          ORC-135. Fixing PPD for timestamps across timezones.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/omalley/orc orc-135

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/90.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #90


          commit a8ec096dd97483cca7a94aa291477bd5df4f5784
          Author: Prasanth Jayachandran <prasanthj@apache.org>
          Date: 2017-01-27T01:48:25Z

          ORC-135: PPD for timestamp is wrong when reader and writer timezones are different

          commit 356687c59298ea74448a4e0b292022821b30f518
          Author: Owen O'Malley <omalley@apache.org>
          Date: 2017-01-30T17:04:30Z

          ORC-135. Changes for moving the encoding to the ColumnEncoding.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/90 ORC-135 . Fixing PPD for timestamps across timezones. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-135 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/90.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #90 commit a8ec096dd97483cca7a94aa291477bd5df4f5784 Author: Prasanth Jayachandran <prasanthj@apache.org> Date: 2017-01-27T01:48:25Z ORC-135 : PPD for timestamp is wrong when reader and writer timezones are different commit 356687c59298ea74448a4e0b292022821b30f518 Author: Owen O'Malley <omalley@apache.org> Date: 2017-01-30T17:04:30Z ORC-135 . Changes for moving the encoding to the ColumnEncoding.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/90

          Of course we should rebase & squash before committing.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/90 Of course we should rebase & squash before committing.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/90

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/90
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/87

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/87
          Hide
          owen.omalley Owen O'Malley added a comment -

          I committed this. Thanks, Prasanth!

          Show
          owen.omalley Owen O'Malley added a comment - I committed this. Thanks, Prasanth!
          Hide
          owen.omalley Owen O'Malley added a comment -

          Released as part of ORC 1.3.2.

          Show
          owen.omalley Owen O'Malley added a comment - Released as part of ORC 1.3.2.

            People

            • Assignee:
              prasanth_j Prasanth Jayachandran
              Reporter:
              prasanth_j Prasanth Jayachandran
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development