Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10493

[C++][Parquet] Writing nullable nested strings results in wrong data in file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 3.0.0
    • C++
    • Python 3.6

    Description

      When I try writing a column of type `struct(string)` that has more elements than the write_batch_size, the output will only contain the first batch, repeated. The data in batches after the first batch are not written to the output.

      I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains all the data as expected.
       
      This python test case reproduces the problem, the last value in the output is "key-0" instead of the expected "key-1024":
       

      import io
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      def test_struct_array():
          default_writer_batch_size = 1024
          n_samples = default_writer_batch_size + 1
          keys = [f"key-{i}" for i in range(n_samples)]
          expected = list(keys)
      
          struct_array = pa.StructArray.from_arrays(
              [pa.array(keys, type=pa.string())],
              names=["string"],
          )
          table = pa.table({"struct": struct_array})
      
          buf = io.BytesIO()
          pq.write_table(table, buf)
      
          actual = pq.read_table(buf).flatten()[0].to_pylist()
      
          assert actual[:1024] == expected[:1024]
          assert actual[-1] == expected[-1], (actual[-1], expected[-1])
      

       

      Attachments

        Issue Links

          Activity

            People

              chrisavl Christian Lundgren
              chrisavl Christian Lundgren
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h