[ARROW-11607] [Python] Error when reading table with list values from parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0, 1.0.1, 2.0.0, 3.0.0
Fix Version/s: 4.0.0
Component/s: C++, Python
Labels:
- pull-request-available
Environment:
Python 3.7

External issue URL:
https://github.com/apache/arrow/issues/27474

Description

I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.

Example code (pyarrow 2.0.0 and 3.0.0):

from pyarrow import parquet, Table

data = [None] * (1 << 20)
data.append([1])

table = Table.from_arrays([data], ['column'])
print('Expected: %s' % table['column'][-1])

parquet.write_table(table, 'table.parquet')

table2 = parquet.read_table('table.parquet')
print('Actual:   %s' % table2['column'][-1]

Output:

Expected: [1]
Actual:   [0]

When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:

Expected: [1]
Actual:   [1]

For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.

It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):

data.append([{'a': 0.1, 'b': datetime.now()}])

I'm getting this exception after calling table2.to_pandas() :

/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool

Attachments

Issue Links

links to

GitHub Pull Request #9498

Activity

People

Assignee:: Micah Kornfield

Reporter:: Michal Glaus

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Feb/21 12:36

Updated:: 11/Jan/23 08:20

Resolved:: 17/Feb/21 14:52

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

[Python] Error when reading table with list values from parquet