Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5618

[C++] [Parquet] Using deprecated Int96 storage for timestamps triggers integer overflow in some cases

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • C++

    Description

      When storing Arrow timestamps in Parquet files using the Int96 storage format, certain combinations of array lengths and validity bitmasks cause an integer overflow error on read.  It's not immediately clear whether the Arrow/Parquet writer is storing zeroes when it should be storing positive values or the reader is attempting to calculate a nanoseconds value inappropriately from zeroed inputs (perhaps missing the null bit flag).  Also not immediately clear why only certain length columns seem to be affected.

      Probably the quickest way to reproduce this undefined behavior is to alter the existing unit test UseDeprecatedInt96 (in file .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling its column lengths (repeating the same values), followed by 'make unittest' using clang-7 with sanitizers enabled.  (Here's a patch applicable to current master that changes the test as described: [1]; I used the following cmake command to build my environment: [2].)  You should get a log something like [3].  If requested, I'll see if I can put together a stand-alone minimal test case that induces the behavior.

      The quick-hack at [4] will prevent integer overflows, but this is only included to confirm the proximate cause of the bug: the Julian days field of the Int96 appears to be zero, when a strictly positive number is expected.

      I've assigned the issue to myself and I'll start looking into the root cause of this.

      [1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
      [2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
      [3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
      [4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f

      Attachments

        Issue Links

          Activity

            People

              tpboudreau TP Boudreau
              tpboudreau TP Boudreau
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m