Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6861

[Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.15.0
    • 0.15.1, 0.16.0
    • C++, Python
    • debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)

    Description

      I'll need to jump through hoops to upload the (seemingly-valid) Parquet file that triggers this bug. In the meantime, here's the error I get, reading the Parquet file with read_dictionary=true. I'll start with the stack trace:

      Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

      #0 0x0000000000b9fffd in __cxa_throw ()
      #1 0x00000000004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow (this=0x555556612e50, num_values=67339, null_count=0, valid_bits=0x7f39a764b780 '\377' <repeats 200 times>..., valid_bits_offset=748544,
      {{ builder=0x555556616330) at /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
      #2 0x000000000046d703 in parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced (this=0x555556616260, values_to_read=67339, null_count=0)
      {{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
      #3 0x00000000004a13f8 in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecordData (this=0x555556616260, num_records=67339)
      {{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
      #4 0x0000000000493876 in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecords (this=0x555556616260, num_records=815883)
      {{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
      #5 0x0000000000413955 in parquet::arrow::LeafReader::NextBatch (this=0x555556615640, records_to_read=815883, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413
      #6 0x0000000000412081 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x5555566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218
      #7 0x00000000004121b0 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x5555566067a0, i=7, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223
      #8 0x0000000000405fbd in readParquet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()

      And now a report of my gdb adventures:

      In Arrow 0.15.0, when reading a particular dictionary column (read_dictionaries=true) with 815883 rows that was written by Arrow 0.14.1, arrow::Dictionary32Builder<arrow::BinaryType>::AppendIndices(...) is called twice (once with 493568 values, once with 254976 values); and then PlainByteArrayDecoder::DecodeArrow() is called. (I'm a novice; I don't know why this column comes in three batches.) On first AppendIndices() call, the buffer capacity is equal to the number of values. On second call, that's no longer the case: the buffer grows using BufferBuilder::GrowByFactor, so its capacity is 987136.

      But there's a bug: the 987136-capacity buffer is in Dictionary32Builder::indices_builder_; so 987136 is stored in Dictionary32Builder::indices_builder_.capacity_. Dictionary32Builder::capacity_ does not change when AppendIndices() is called. (Dictionary32Builder behaves like a proxy for its indices_builder_; but its capacity() method is not virtual, so things are messy.)

      So builder.capacity_ is 0. Then comes the final batch of 67339 values, via DecodeArrow(). It calls builder->Reserve(num_values). But builder->Reserve(num_values) tries to increase the capacity from 0 (its wrong, cached value) to length_ + num_values (815883). Since indicies_builder->capacity_ is 987136, that's a downsize – which throws an exception.

      The only workaround I can find: use read_dictionaries=false.

      This affects Python, too.

      I've attached a patch that fixes the issue for my file. I don't know how to formulate a reduction, though, so I haven't contributed unit tests. I'm also not certain how FinishInternal is meant to work, so this definitely needs expert review. (FinishInternal was definitely buggy before my patch; after my patch it might be buggy but I don't know.)

      Attachments

        1. fix-dict-builder-capacity.diff
          2 kB
          Adam Hooper
        2. parquet-written-by-arrow-0-14-1.7z
          53.07 MB
          Adam Hooper

        Issue Links

          Activity

            adamhooper Adam Hooper added a comment -

            I've attached a Parquet file, written by Arrow 0.14.1, which causes this problem. Column 8 (among others) causes this problem. Most columns work fine.

            adamhooper Adam Hooper added a comment - I've attached a Parquet file, written by Arrow 0.14.1, which causes this problem. Column 8 (among others) causes this problem. Most columns work fine.
            wesm Wes McKinney added a comment -

            Thanks. This should be enough information to help write a unit test to reproduce the issue. bkietz are you interested in taking a look?

            wesm Wes McKinney added a comment - Thanks. This should be enough information to help write a unit test to reproduce the issue. bkietz are you interested in taking a look?
            wesm Wes McKinney added a comment -

            Seems like a good candidate for 0.15.1. Marked as such

            wesm Wes McKinney added a comment - Seems like a good candidate for 0.15.1. Marked as such
            wesm Wes McKinney added a comment -

            I started looking at this

            wesm Wes McKinney added a comment - I started looking at this
            wesm Wes McKinney added a comment -

            Issue resolved by pull request 5643
            https://github.com/apache/arrow/pull/5643

            wesm Wes McKinney added a comment - Issue resolved by pull request 5643 https://github.com/apache/arrow/pull/5643
            rokm Rok Mihevc added a comment -

            This issue has been migrated to issue #23190 on GitHub. Please see the migration documentation for further details.

            rokm Rok Mihevc added a comment - This issue has been migrated to issue #23190 on GitHub. Please see the migration documentation for further details.

            People

              wesm Wes McKinney
              adamhooper Adam Hooper
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m