Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1438

[C++] corrupted files produced on 32-bit architecture (i686)

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.12.0
    • None
    • None

    Description

      I'm using C++ API to convert some data to parquet files. I've noticed a regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to arrow-cpp 0.11.0. The issue is that I can write parquet files without an error, but when I try to read those using pyarrow I get a segfault:

      #0  0x00007fffd17c7f0f in int arrow::util::RleDecoder::GetBatchWithDictSpaced<float>(float const*, float*, int, int, unsigned char const*, long) ()
         from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      #1  0x00007fffd17c8025 in parquet::DictionaryDecoder<parquet::DataType<(parquet::Type::type)4> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
         from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      #2  0x00007fffd17bcf0f in parquet::internal::TypedRecordReader<parquet::DataType<(parquet::Type::type)4> >::ReadRecordData(long) ()
         from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      #3  0x00007fffd17bfbea in parquet::internal::TypedRecordReader<parquet::DataType<(parquet::Type::type)4> >::ReadRecords(long) ()
         from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      #4  0x00007fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, std::shared_ptr<arrow::Array>*) ()
         from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      #5  0x00007fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, std::shared_ptr<arrow::Array>*) ()
         from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      #6  0x00007fffd179a6e5 in parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Array>*) ()
         from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      #7  0x00007fffd179aaad in parquet::arrow::FileReader::Impl::ReadTable(std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*)::{lambda(int)#1}::operator()(int) const () from /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
      

      I have not been able to dig to the bottom of the issue, but it seems like the problem reproduces only when I run 32 bit binaries. After I learned that, I found that 32 bit and 64 bit codes produce very different different parquet files for the same data. The sizes of the structures are clearly different if I look at their hexdumps. I'm attaching those example files. Reading "32.parquet" (produced using i686 binaries) will cause a segfault on macOS and linux, "64.parquet" will read just fine.

      Attachments

        1. 32.parquet
          2 kB
          Dmitry Kalinkin
        2. 64.parquet
          1.0 kB
          Dmitry Kalinkin
        3. arrow_0.10.0_i686_test_fail.log
          153 kB
          Dmitry Kalinkin
        4. arrow_0.11.0_i686_test_fail.log
          1.59 MB
          Dmitry Kalinkin
        5. parquet_1.5.0_i686_test_success.log
          2 kB
          Dmitry Kalinkin

        Activity

          People

            Unassigned Unassigned
            veprbl Dmitry Kalinkin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment