Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-1087

Seek overflow in an uncompressed chunk

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.7.0, 1.7.1, 1.7.2
    • 1.7.3
    • C++
    • None

    Description

      Reading the attached ORC file with SearchArgument "sr_return_amt > 10000" using the C++ reader will fail with

      Corrupt PATCHED_BASE encoded data (pl==0)!

      It's ok to read it without the SearchArgument. The java reader is able to read it with the same SearchArgument.

      Attached the source codes (scan_with_sarg.cc) for reproducing the issue. Build the ORC lib and compile it by

      g++ scan_with_sarg.cc -o scan_with_sarg -I../c++/include -Ic++/include -Lc++/src/ -Lsnappy_ep-prefix/src/snappy_ep-build/ -Llz4_ep-prefix/src/lz4_ep-build/ -Lzlib_ep-prefix/src/zlib_ep-build/ -Lzstd_ep-prefix/src/zstd_ep-build/lib/ -Lprotobuf_ep-prefix/src/protobuf_ep-build/ -lorc -lz -lsnappy -llz4 -lzstd -lprotobuf
      

      Run it as

      $ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:zstd_ep-prefix/src/zstd_ep-build/lib/" ./scan_with_sarg 
      leaf-0 = (column(id=17) <= 10000), expr = (not leaf-0)
      terminate called after throwing an instance of 'orc::ParseError'
        what():  Corrupt PATCHED_BASE encoded data (pl==0)!
      Aborted (core dumped)
      

      RCA

      The sarg introduces a seek to RowGroup 42. The following codes in DecompressionStream::seek didn't handle the case when uncompressedBufferLength < posInChunk. Then seeks to an illegal position and the length overflow.

      if (headerPosition == seekedPosition
          && inputBufferStartPosition <= headerPosition + 3 && inputBufferStart) {
        position.next(); // Skip the input level position.
        size_t posInChunk = position.next(); // Chunk level position.
        // Overflow here! uncompressedBufferLength=30950, posInChunk=39498
        outputBufferLength = uncompressedBufferLength - posInChunk;
        outputBuffer = outputBufferStart + posInChunk;
        return;
      }

      That chunk is an uncompressed chunk, and the whole chunk is read in pieces. The position (posInChunk) hasn't been read out yet. We need to handle this case.

      I think this only happens on uncompressed chunks. For compressed chunks, they are decompressed as a whole. So posInChunk will always be valid in the output buffer.

      Attachments

        1. seek-issue-snappy-500k.orc
          23.78 MB
          Quanlong Huang
        2. scan_with_sarg.cc
          1 kB
          Quanlong Huang

        Issue Links

          Activity

            People

              stigahuang Quanlong Huang
              stigahuang Quanlong Huang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: