[ARROW-3762] [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: C++, Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/20081

Description

When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
    t = pa.Table.from_arrays([x], ['x'])
    writer = pq.ParquetWriter(demo, t.schema)
    for i in range(2):
        writer.write_table(t)
    writer.close()

    pf = pq.ParquetFile(demo)

    # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
    t2 = pf.read()

    # Works, but note, there are 32 row groups, not 2 as suggested by:
    # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
    tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
    t3 = pa.concat_tables(tables)

scenario()

Attachments

Issue Links

is duplicated by

ARROW-2654 [Python] Error with errno 22 when loading 3.6 GB Parquet file

Closed

ARROW-3139 [Python] ArrowIOError: Arrow error: Capacity error during read

Closed

is related to

ARROW-5030 [Python] read_row_group fails with Nested data conversions not implemented for chunked array outputs

Open

ARROW-2532 [C++] Add chunked builder classes

Open

relates to

ARROW-2227 [Python] Table.from_pandas does not create chunked_arrays.

Resolved

links to

Apache Arrow Issue 1677

GitHub Pull Request #3171

GitHub Pull Request #4695

GitHub Pull Request #5312

(4 links to)

Activity

People

Assignee:: Ben Kietzman

Reporter:: Left Screen

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 01/Mar/18 20:07

Updated:: 11/Jan/23 07:29

Resolved:: 09/Sep/19 21:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

8h 40m