[ARROW-1446] Python: Writing more than 2^31 rows from pandas dataframe causes row count overflow error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.0
Fix Version/s: 0.7.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/17467

Description

I have the following code:

import pyarrow
import pyarrow.parquet as pq

client = pyarrow.HdfsClient("<host>", <port>, "<user>", driver='libhdfs3')
abc_table = client.read_parquet('<source parquet>', nthreads=16)
abc_df = abc_table.to_pandas()
abc_table = pyarrow.Table.from_pandas(abc_df)
with client.open('<target parquet>', 'wb') as f:
    pq.write_table(abc_table, f)

<source parquet> contains 2497301128 rows.

During the write however I get the following error:

{format}
Traceback (most recent call last):
File "pyarrow_cluster.py", line 29, in <module>
main()
File "pyarrow_cluster.py", line 26, in main
pq.write_table(nmi_table, f)
File "<home dir>/miniconda2/envs/parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 796, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "_parquet.pyx", line 663, in pyarrow._parquet.ParquetWriter.write_table
File "error.pxi", line 72, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Written rows: -1797666168 != expected rows: 2497301128in the current column chunk{format}

The number of written rows specified suggests a 32-bit signed integer has overflowed.

Attachments

Issue Links

depends upon

PARQUET-1090 [C++] Fix int32 overflow in Arrow table writer, add max row group size property

Resolved

Activity

People

Assignee:: Wes McKinney

Reporter:: James Porritt

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Sep/17 11:23

Updated:: 11/Jan/23 07:14

Resolved:: 06/Sep/17 23:27