Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2372

[Python] ArrowIOError: Invalid argument when reading Parquet file

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0, 0.9.0
    • Fix Version/s: 0.10.0
    • Component/s: Python
    • Labels:
      None
    • Environment:
      Ubuntu 16.04

      Description

      I get an ArrowIOError when reading a specific file that was also written by pyarrow. Specifically, the traceback is:

      >>> import pyarrow.parquet as pq
      >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
       ---------------------------------------------------------------------------
       ArrowIOError Traceback (most recent call last)
       <ipython-input-18-149f11bf68a5> in <module>()
       ----> 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
      ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in _init_(self, source, metadata, common_metadata)
       62 self.reader = ParquetReader()
       63 source = _ensure_file(source)
       ---> 64 self.reader.open(source, metadata=metadata)
       65 self.common_metadata = common_metadata
       66 self._nested_paths_by_prefix = self._build_nested_paths()
      _parquet.pyx in pyarrow._parquet.ParquetReader.open()
      error.pxi in pyarrow.lib.check_status()
      ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
      

      Here's a reproducible example with the specific file I'm working with. I'm converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get the source data:

      wget https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
      unzip gaz2016zcta5distancemiles.csv.zip

      Then the basic idea from the pyarrow Parquet documentation is instantiating the writer class; looping over chunks of the csv and writing them to parquet; then closing the writer object.

       

      import numpy as np
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      from pathlib import Path
      
      zcta_file = Path('gaz2016zcta5distancemiles.csv')
      itr = pd.read_csv(
          zcta_file,
          header=0,
          dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
          engine='c',
          chunksize=64617153)
      
      schema = pa.schema([
          pa.field('zip1', pa.string()),
          pa.field('zip2', pa.string()),
          pa.field('mi_to_zcta5', pa.float64())])
      
      writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
      print(f'Starting conversion')
      
      i = 0
      for df in itr:
          i += 1
          print(f'Finished reading csv block {i}')
      
          table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
          writer.write_table(table)
      
          print(f'Finished writing parquet block {i}')
      
      writer.close()
      

      Then running this python script produces the file 

      gaz2016zcta5distancemiles.parquet

      , but just attempting to read the metadata with `pq.ParquetFile()` produces the above exception.

      I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would complain on import of the csv if the columns in the data were not `string`, `string`, and `float64`, so I think creating the Parquet schema in that way should be fine.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                apitrou Antoine Pitrou
                Reporter:
                kylebarron Kyle Barron
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: