Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Won't Fix
-
0.13.0
-
None
-
None
Description
The schema for an empty table/dataframe still includes the index as an integer column instead of being serialized solely as a metadata reference (see ARROW-1639)
In the example below, the empty dataframe still holds `_index_level_0_` as an integer column. Proper behavior would be to exclude it and reference the index information in the pandas metadata as it is the case for a non-empty column
In [1]: import pandas as pd im In [2]: import pyarrow as pa In [3]: non_empty = pd.DataFrame({"col": [1]}) In [4]: empty = non_empty.drop(0) In [5]: empty Out[5]: Empty DataFrame Columns: [col] Index: [] In [6]: pa.Table.from_pandas(non_empty) Out[6]: pyarrow.Table col: int64 metadata -------- OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "col", "field_name": "col", "pandas_type": "int64",' b' "numpy_type": "int64", "metadata": null}], "creator": {"lib' b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu' b'll}')]) In [7]: pa.Table.from_pandas(empty) Out[7]: pyarrow.Table col: int64 __index_level_0__: int64 metadata -------- OrderedDict([(b'pandas', b'{"index_columns": ["__index_level_0__"], "column_indexes": [' b'{"name": null, "field_name": null, "pandas_type": "unicode",' b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]' b', "columns": [{"name": "col", "field_name": "col", "pandas_t' b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n' b'ame": null, "field_name": "__index_level_0__", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null}], "creat' b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve' b'rsion": null}')]) In [8]: pa.__version__ Out[8]: '0.13.0' In [9]: ! python --version Python 3.6.7
Attachments
Issue Links
- is related to
-
ARROW-5427 [Python] RangeIndex serialization change implications
- Resolved