Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-611

Incorrect min-max stats for sub-millisecond timestamps

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.6.4, 1.7.0
    • C++, Java
    • None

    Description

      The issue is related to the precision of storing timestamps:

      • nanoseconds for the data itself
      • only milliseconds for min-max statistics

      Both min and max are rounded to the same value, while min should be rounded down and max should be rounded up to ensure that the values are actually within that range.

      Repro in Hive:

      create table tsstat (ts timestamp) stored as orc;
      insert into tsstat values ("1970-01-01 00:00:00.0005")
      select * from tsstat where ts = "1970-01-01 00:00:00.0005";
      -- returned 0 rows
      

      Both the Java and the C++ writer has this issue (thanks Quanlong Huang for looking them up):
      https://github.com/apache/orc/blob/fea154436c37c81a16b13d879b510096cfaa2946/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L108
      https://github.com/apache/orc/blob/fea154436c37c81a16b13d879b510096cfaa2946/c%2B%2B/src/ColumnWriter.cc#L1800

      I guess that there are already files with this issue in production, so I think that the only way to fix this is to hack the reader:

      • decrease/increase min/max stats with 1 ms after reading them from file
      • also be careful about the values pushed down, as the same precision loss can occur there to, eg. "WHERE ts <'1970-01-01 00:00:00.0005' AND ts > '1970-01-01 00:00:00.0004'" shouldn't be converted to ts < "1970-01-01" AND ts > "1970-01-01"

      The issue was discovered during an Impala review: https://gerrit.cloudera.org/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc@875

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pgaref Panagiotis Garefalakis
            csringhofer Csaba Ringhofer
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m

                Slack

                  Issue deployment