Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.7.1
Description
Found this bug in the example in the pandas documentation (http://pandas-docs.github.io/pandas-docs-travis/io.html#parquet), which does:
df = pd.DataFrame({'a': list('abc'), 'b': list(range(1, 4)), 'c': np.arange(3, 6).astype('u1'), 'd': np.arange(4.0, 7.0, dtype='float64'), 'e': [True, False, True], 'f': pd.date_range('20130101', periods=3), 'g': pd.date_range('20130101', periods=3, tz='US/Eastern')}) df.to_parquet('example_pa.parquet', engine='pyarrow') pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
and this raises in the last line reading a subset of columns:
... /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata) 357 for i, col_meta in enumerate(pandas_metadata['columns']): 358 if col_meta['pandas_type'] == 'datetimetz': --> 359 col = table[i] 360 converted = col.to_pandas() 361 tz = col_meta['metadata']['timezone'] table.pxi in pyarrow.lib.Table.__getitem__() table.pxi in pyarrow.lib.Table.column() IndexError: Table column index 6 is out of range
This is due to checking the `pandas_metadata` for all columns (and in this case trying to deal with a datetime tz column), while in practice not all columns are present in this case ('mismatch' between pandas metadata and actual schema).
A smaller example without parquet:
In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')}) In [39]: table = pyarrow.Table.from_pandas(df) In [40]: table Out[40]: pyarrow.Table a: int64 b: timestamp[ns, tz=Europe/Brussels] __index_level_0__: int64 metadata -------- {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t' b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta' b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6' b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", ' b'"metadata": null, "numpy_type": "int64", "name": "__index_level_' b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"' b': "0.22.0.dev0+277.gd61f411"}'} In [41]: table.to_pandas() Out[41]: a b 0 1 2017-01-01 00:00:00+01:00 1 2 2017-01-02 00:00:00+01:00 2 3 2017-01-03 00:00:00+01:00 In [44]: table_without_tz = table.remove_column(1) In [45]: table_without_tz Out[45]: pyarrow.Table a: int64 __index_level_0__: int64 metadata -------- {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t' b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta' b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6' b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", ' b'"metadata": null, "numpy_type": "int64", "name": "__index_level_' b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"' b': "0.22.0.dev0+277.gd61f411"}'} In [46]: table_without_tz.to_pandas() # <------ wrong output ! Out[46]: a 1970-01-01 01:00:00+01:00 1 1970-01-01 01:00:00.000000001+01:00 2 1970-01-01 01:00:00.000000002+01:00 3 In [47]: table_without_tz2 = table_without_tz.remove_column(1) In [48]: table_without_tz2 Out[48]: pyarrow.Table a: int64 metadata -------- {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t' b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta' b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6' b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", ' b'"metadata": null, "numpy_type": "int64", "name": "__index_level_' b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"' b': "0.22.0.dev0+277.gd61f411"}'} In [49]: table_without_tz2.to_pandas() # <------ error ! --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-49-c82f33476c6b> in <module>() ----> 1 table_without_tz2.to_pandas() table.pxi in pyarrow.lib.Table.to_pandas() /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads) 289 pandas_metadata = json.loads(metadata[b'pandas'].decode('utf8')) 290 index_columns = pandas_metadata['index_columns'] --> 291 table = _add_any_metadata(table, pandas_metadata) 292 293 block_table = table /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata) 357 for i, col_meta in enumerate(pandas_metadata['columns']): 358 if col_meta['pandas_type'] == 'datetimetz': --> 359 col = table[i] 360 converted = col.to_pandas() 361 tz = col_meta['metadata']['timezone'] table.pxi in pyarrow.lib.Table.__getitem__() table.pxi in pyarrow.lib.Table.column() IndexError: Table column index 1 is out of range
The reason is that `_add_any_metadata` does not check if the column it is processing (currently only datetime tz columns need such processing) is actually present in the schema.
Working on a fix, will submit a PR.
Attachments
Issue Links
- links to