Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.13.0
-
Mac OSX 10.14.4, Python 3.7.1, x86_64.
Description
Here's a contrived example that reproduces this issue using pandas:
import numpy as np import pandas as pd real_usernames = np.array(['anonymize', 'me']) usernames = pd.util.hash_array(real_usernames) login_count = [13, 9] df = pd.DataFrame({'user': usernames, 'logins': login_count}) df.to_parquet('can_write.parq', partition_cols=['user']) # But not read pd.read_parquet('can_write.parq')
Expected behaviour:
- Either the write fails
- Or the read succeeds
Actual behaviour: The read fails with the following error:
Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1152, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py", line 181, in read_parquet use_pandas_metadata=use_pandas_metadata) File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1014, in read use_pandas_metadata=use_pandas_metadata) File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 587, in read dictionary = partitions.levels[i].dictionary File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 642, in dictionary dictionary = lib.array(integer_keys) File "pyarrow/array.pxi", line 173, in pyarrow.lib.array File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to C long
I set the priority to minor here because it's easy enough to work around this in user code unless you really need the 64 bit hash (and you probably shouldn't be partitioning on that anyway).
I could take a stab at writing a patch for this if there's interest?
Attachments
Issue Links
- links to