Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0
-
Python 3.6
Description
When I try writing a column of type `struct(string)` that has more elements than the write_batch_size, the output will only contain the first batch, repeated. The data in batches after the first batch are not written to the output.
I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains all the data as expected.
This python test case reproduces the problem, the last value in the output is "key-0" instead of the expected "key-1024":
import io import pyarrow as pa import pyarrow.parquet as pq def test_struct_array(): default_writer_batch_size = 1024 n_samples = default_writer_batch_size + 1 keys = [f"key-{i}" for i in range(n_samples)] expected = list(keys) struct_array = pa.StructArray.from_arrays( [pa.array(keys, type=pa.string())], names=["string"], ) table = pa.table({"struct": struct_array}) buf = io.BytesIO() pq.write_table(table, buf) actual = pq.read_table(buf).flatten()[0].to_pylist() assert actual[:1024] == expected[:1024] assert actual[-1] == expected[-1], (actual[-1], expected[-1])
Attachments
Issue Links
- duplicates
-
ARROW-11257 [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet
- Closed
-
ARROW-11069 [C++] Parquet writer incorrect data being written when data type is struct
- Closed
- is duplicated by
-
ARROW-11024 [C++][Parquet] Writing List<Struct> to parquet sometimes writes wrong data
- Resolved
- links to