Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1446

Python: Writing more than 2^31 rows from pandas dataframe causes row count overflow error

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: Python
    • Labels:
      None

      Description

      I have the following code:

      import pyarrow
      import pyarrow.parquet as pq
      
      client = pyarrow.HdfsClient("<host>", <port>, "<user>", driver='libhdfs3')
      abc_table = client.read_parquet('<source parquet>', nthreads=16)
      abc_df = abc_table.to_pandas()
      abc_table = pyarrow.Table.from_pandas(abc_df)
      with client.open('<target parquet>', 'wb') as f:
          pq.write_table(abc_table, f)
      

      <source parquet> contains 2497301128 rows.

      During the write however I get the following error:

      {format}
      Traceback (most recent call last):
      File "pyarrow_cluster.py", line 29, in <module>
      main()
      File "pyarrow_cluster.py", line 26, in main
      pq.write_table(nmi_table, f)
      File "<home dir>/miniconda2/envs/parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 796, in write_table
      writer.write_table(table, row_group_size=row_group_size)
      File "_parquet.pyx", line 663, in pyarrow._parquet.ParquetWriter.write_table
      File "error.pxi", line 72, in pyarrow.lib.check_status
      pyarrow.lib.ArrowIOError: Written rows: -1797666168 != expected rows: 2497301128in the current column chunk{format}

      The number of written rows specified suggests a 32-bit signed integer has overflowed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesmckinn Wes McKinney
                Reporter:
                jporritt James Porritt
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: