Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
- When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however.
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq x = pa.array(list('1' * 2**30)) demo = 'demo.parquet' def scenario(): t = pa.Table.from_arrays([x], ['x']) writer = pq.ParquetWriter(demo, t.schema) for i in range(2): writer.write_table(t) writer.close() pf = pq.ParquetFile(demo) # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647 t2 = pf.read() # Works, but note, there are 32 row groups, not 2 as suggested by: # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] t3 = pa.concat_tables(tables) scenario()
Attachments
Issue Links
- is duplicated by
-
ARROW-2654 [Python] Error with errno 22 when loading 3.6 GB Parquet file
- Closed
-
ARROW-3139 [Python] ArrowIOError: Arrow error: Capacity error during read
- Closed
- is related to
-
ARROW-5030 [Python] read_row_group fails with Nested data conversions not implemented for chunked array outputs
- Open
-
ARROW-2532 [C++] Add chunked builder classes
- Open
- relates to
-
ARROW-2227 [Python] Table.from_pandas does not create chunked_arrays.
- Resolved
- links to
(4 links to)