Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5430

[Python] Can read but not write parquet partitioned on large ints

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.13.0
    • 0.14.0
    • Python
    • Mac OSX 10.14.4, Python 3.7.1, x86_64.

    Description

      Here's a contrived example that reproduces this issue using pandas:

      import numpy as np
      import pandas as pd
      
      real_usernames = np.array(['anonymize', 'me'])
      usernames = pd.util.hash_array(real_usernames)
      login_count = [13, 9]
      df = pd.DataFrame({'user': usernames, 'logins': login_count})
      df.to_parquet('can_write.parq', partition_cols=['user'])
      # But not read
      pd.read_parquet('can_write.parq')

      Expected behaviour:

      • Either the write fails
      • Or the read succeeds

      Actual behaviour: The read fails with the following error:

      Traceback (most recent call last):
        File "<stdin>", line 2, in <module>
        File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet
          return impl.read(path, columns=columns, **kwargs)
        File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read
          **kwargs).to_pandas()
        File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1152, in read_table
          use_pandas_metadata=use_pandas_metadata)
        File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py", line 181, in read_parquet
          use_pandas_metadata=use_pandas_metadata)
        File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 1014, in read
          use_pandas_metadata=use_pandas_metadata)
        File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 587, in read
          dictionary = partitions.levels[i].dictionary
        File "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", line 642, in dictionary
          dictionary = lib.array(integer_keys)
        File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
        File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
        File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
      pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to C long

      I set the priority to minor here because it's easy enough to work around this in user code unless you really need the 64 bit hash (and you probably shouldn't be partitioning on that anyway).

      I could take a stab at writing a patch for this if there's interest?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kaaveland Robin Kåveland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m