Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6999

[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.12.0, 0.13.0, 0.14.0, 0.15.0
    • 0.16.0
    • Python
    • pandas==0.23.4
      pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0

    Description

      Steps to reproduce:

      1. Generate any DataFrame's pyarrow Schema using Table.from_pandas
      2. Pass the generated schema as input into Table.from_pandas
      3. Causes KeyError: '_index_level_0_'

      We did not have this issue with pyarrow==0.11.0 which we used to write many partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce schema going forward that are backwards compatible (i.e. also have '_index_level_0_'), so we should not need to re-generate all prior years' partitions when we migrate to 0.15.0.

      We cannot set preserve_index=False, since that effectively deletes '_index_level_0_', causing inconsistent schema across earlier partitions that had been written using pyarrow==0.11.0.

       

      import pandas as pd
      import pyarrow as pa
      df = pd.DataFrame() 
      schema = pa.Table.from_pandas(df).schema
      pa_table = pa.Table.from_pandas(df, schema=schema)
      
      
      Traceback (most recent call last):
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
          return self._engine.get_loc(key)
        File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
        File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
        File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
        File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
      KeyError: '__index_level_0__'
      During handling of the above exception, another exception occurred:
      Traceback (most recent call last):
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 408, in _get_columns_to_convert_given_schema
          col = df[name]
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
          return self._getitem_column(key)
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
          return self._get_item_cache(key)
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
          values = self._data.get(item)
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
          loc = self.items.get_loc(item)
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
          return self._engine.get_loc(self._maybe_cast_indexer(key))
        File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
        File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
        File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
        File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
      KeyError: '__index_level_0__'
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
          exec(code_obj, self.user_global_ns, self.user_ns)
        File "<ipython-input-36-6711a2fcec96>", line 5, in <module>
          pa_table = pa.Table.from_pandas(df, schema=pa.Table.from_pandas(df).schema)
        File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 517, in dataframe_to_arrays
          columns)
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 337, in _get_columns_to_convert
          return _get_columns_to_convert_given_schema(df, schema, preserve_index)
        File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 426, in _get_columns_to_convert_given_schema
          "in the columns or index".format(name))
      KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index"
      

      Attachments

        1. test3.hdf
          6.43 MB
          Tom Goodman

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              goodiegoodman Tom Goodman
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m