[ARROW-6492] [Python] file written with latest fastparquet cannot be read with latest pyarrow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/22861

Description

From report on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/28252

With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), writing a file with pandas using the fastparquet engine cannot be read with the pyarrow engine:

df = pd.DataFrame({'A': [1, 2, 3]})
df.to_parquet("test.parquet", engine="fastparquet", compression=None)                                                                                                                                     
pd.read_parquet("test.parquet", engine="pyarrow")

gives the following error when reading:

----> 1 pd.read_parquet("test.parquet", engine="pyarrow")

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    292 
    293     impl = get_engine(engine)
--> 294     return impl.read(path, columns=columns, **kwargs)

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    123         kwargs["use_pandas_metadata"] = True
    124         result = self.api.parquet.read_table(
--> 125             path, columns=columns, **kwargs
    126         ).to_pandas()
    127         if should_close:

~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata)
    642         column_indexes = pandas_metadata.get('column_indexes', [])
    643         index_descriptors = pandas_metadata['index_columns']
--> 644         table = _add_any_metadata(table, pandas_metadata)
    645         table, index = _reconstruct_index(table, index_descriptors,
    646                                           all_columns)

~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata)
    965                 raw_name = 'None'
    966 
--> 967         idx = schema.get_field_index(raw_name)
    968         if idx != -1:
    969             if col_meta['pandas_type'] == 'datetimetz':

~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.get_field_index()

~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

TypeError: expected bytes, dict found

Attachments

Issue Links

links to

GitHub Pull Request #5331

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Sep/19 12:39

Updated:: 11/Jan/23 07:47

Resolved:: 09/Sep/19 20:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h