Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6492

[Python] file written with latest fastparquet cannot be read with latest pyarrow

Details

    Description

      From report on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/28252

      With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), writing a file with pandas using the fastparquet engine cannot be read with the pyarrow engine:

      df = pd.DataFrame({'A': [1, 2, 3]})
      df.to_parquet("test.parquet", engine="fastparquet", compression=None)                                                                                                                                     
      pd.read_parquet("test.parquet", engine="pyarrow")   
      

      gives the following error when reading:

      ----> 1 pd.read_parquet("test.parquet", engine="pyarrow")
      
      ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
          292 
          293     impl = get_engine(engine)
      --> 294     return impl.read(path, columns=columns, **kwargs)
      
      ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
          123         kwargs["use_pandas_metadata"] = True
          124         result = self.api.parquet.read_table(
      --> 125             path, columns=columns, **kwargs
          126         ).to_pandas()
          127         if should_close:
      
      ~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()
      
      ~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()
      
      ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata)
          642         column_indexes = pandas_metadata.get('column_indexes', [])
          643         index_descriptors = pandas_metadata['index_columns']
      --> 644         table = _add_any_metadata(table, pandas_metadata)
          645         table, index = _reconstruct_index(table, index_descriptors,
          646                                           all_columns)
      
      ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata)
          965                 raw_name = 'None'
          966 
      --> 967         idx = schema.get_field_index(raw_name)
          968         if idx != -1:
          969             if col_meta['pandas_type'] == 'datetimetz':
      
      ~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.get_field_index()
      
      ~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()
      
      TypeError: expected bytes, dict found
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h

                  Slack

                    Issue deployment