[ARROW-5430] [Python] Can read but not write parquet partitioned on large ints - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.14.0
Component/s: Python
Labels:
- parquet
- pull-request-available
Environment:
Mac OSX 10.14.4, Python 3.7.1, x86_64.

External issue URL:
https://github.com/apache/arrow/issues/21883

Description

Here's a contrived example that reproduces this issue using pandas:

import numpy as np
import pandas as pd

real_usernames = np.array(['anonymize', 'me'])
usernames = pd.util.hash_array(real_usernames)
login_count = [13, 9]
df = pd.DataFrame({'user': usernames, 'logins': login_count})
df.to_parquet('can_write.parq', partition_cols=['user'])
# But not read
pd.read_parquet('can_write.parq')

Expected behaviour:

Either the write fails
Or the read succeeds

Actual behaviour: The read fails with the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read
    **kwargs).to_pandas()
  File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1152, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py", line 181, in read_parquet
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1014, in read
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 587, in read
    dictionary = partitions.levels[i].dictionary
  File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 642, in dictionary
    dictionary = lib.array(integer_keys)
  File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to C long

I set the priority to minor here because it's easy enough to work around this in user code unless you really need the 64 bit hash (and you probably shouldn't be partitioning on that anyway).

I could take a stab at writing a patch for this if there's interest?

Attachments

Issue Links

links to

GitHub Pull Request #4440

Activity

People

Assignee:: Unassigned

Reporter:: Robin Kåveland

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/May/19 07:36

Updated:: 11/Jan/23 07:40

Resolved:: 03/Jun/19 14:12

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m