Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.0, 1.0.1, 2.0.0, 3.0.0
-
Python 3.7
Description
I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.
Example code (pyarrow 2.0.0 and 3.0.0):
from pyarrow import parquet, Table data = [None] * (1 << 20) data.append([1]) table = Table.from_arrays([data], ['column']) print('Expected: %s' % table['column'][-1]) parquet.write_table(table, 'table.parquet') table2 = parquet.read_table('table.parquet') print('Actual: %s' % table2['column'][-1]
Output:
Expected: [1] Actual: [0]
When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
Expected: [1] Actual: [1]
For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.
It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):
data.append([{'a': 0.1, 'b': datetime.now()}])
I'm getting this exception after calling table2.to_pandas() :
/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool
Attachments
Issue Links
- links to