Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1273

[Python] Error writing to partitioned Parquet dataset

    XMLWordPrintableJSON

Details

    Description

      I receive the following error after upgrading to pyarrow 0.8.0 when writing to a dataset:

      • ArrowIOError: Column 3 had 187374 while previous column had 10000

      The command was:
      write_table_values =

      {'row_group_size': 10000}

      pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 'hour'], **write_table_values)

      I've also tried write_table_values =

      {'chunk_size': 10000}

      and received the same error.

      This same command works in version 0.7.1. I am trying to troubleshoot the problem but wanted to submit a ticket.

      Attachments

        1. ARROW-1938.py
          0.7 kB
          Robert Dailey
        2. ARROW-1938-test-data.csv.gz
          2.28 MB
          Robert Dailey
        3. pyarrow_dataset_error.png
          263 kB
          Robert Dailey

        Issue Links

          Activity

            People

              joshuastorck Joshua Storck
              rdailey Robert Dailey
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: