Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1883

[Python] BUG: Table.to_pandas metadata checking fails if columns are not present

    XMLWordPrintableJSON

Details

    Description

      Found this bug in the example in the pandas documentation (http://pandas-docs.github.io/pandas-docs-travis/io.html#parquet), which does:

      df = pd.DataFrame({'a': list('abc'),
                         'b': list(range(1, 4)),
                         'c': np.arange(3, 6).astype('u1'),
                         'd': np.arange(4.0, 7.0, dtype='float64'),
                         'e': [True, False, True],
                         'f': pd.date_range('20130101', periods=3),
                         'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
      
      df.to_parquet('example_pa.parquet', engine='pyarrow')
      
      pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
      

      and this raises in the last line reading a subset of columns:

      ...
      /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata)
          357     for i, col_meta in enumerate(pandas_metadata['columns']):
          358         if col_meta['pandas_type'] == 'datetimetz':
      --> 359             col = table[i]
          360             converted = col.to_pandas()
          361             tz = col_meta['metadata']['timezone']
      
      table.pxi in pyarrow.lib.Table.__getitem__()
      
      table.pxi in pyarrow.lib.Table.column()
      
      IndexError: Table column index 6 is out of range
      

      This is due to checking the `pandas_metadata` for all columns (and in this case trying to deal with a datetime tz column), while in practice not all columns are present in this case ('mismatch' between pandas metadata and actual schema).

      A smaller example without parquet:

      In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')})
      
      In [39]: table = pyarrow.Table.from_pandas(df)
      
      In [40]: table
      Out[40]: 
      pyarrow.Table
      a: int64
      b: timestamp[ns, tz=Europe/Brussels]
      __index_level_0__: int64
      metadata
      --------
      {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
                  b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
                  b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
                  b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
                  b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
                  b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
                  b': "0.22.0.dev0+277.gd61f411"}'}
      
      In [41]: table.to_pandas()
      Out[41]: 
         a                         b
      0  1 2017-01-01 00:00:00+01:00
      1  2 2017-01-02 00:00:00+01:00
      2  3 2017-01-03 00:00:00+01:00
      
      In [44]: table_without_tz = table.remove_column(1)
      
      In [45]: table_without_tz
      Out[45]: 
      pyarrow.Table
      a: int64
      __index_level_0__: int64
      metadata
      --------
      {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
                  b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
                  b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
                  b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
                  b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
                  b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
                  b': "0.22.0.dev0+277.gd61f411"}'}
      
      In [46]: table_without_tz.to_pandas()          # <------ wrong output !
      Out[46]: 
                                           a
      1970-01-01 01:00:00+01:00            1
      1970-01-01 01:00:00.000000001+01:00  2
      1970-01-01 01:00:00.000000002+01:00  3
      
      In [47]: table_without_tz2 = table_without_tz.remove_column(1)
      
      In [48]: table_without_tz2
      Out[48]: 
      pyarrow.Table
      a: int64
      metadata
      --------
      {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
                  b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
                  b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
                  b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
                  b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
                  b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
                  b': "0.22.0.dev0+277.gd61f411"}'}
      
      In [49]: table_without_tz2.to_pandas()     # <------ error !
      ---------------------------------------------------------------------------
      IndexError                                Traceback (most recent call last)
      <ipython-input-49-c82f33476c6b> in <module>()
      ----> 1 table_without_tz2.to_pandas()
      
      table.pxi in pyarrow.lib.Table.to_pandas()
      
      /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads)
          289         pandas_metadata = json.loads(metadata[b'pandas'].decode('utf8'))
          290         index_columns = pandas_metadata['index_columns']
      --> 291         table = _add_any_metadata(table, pandas_metadata)
          292 
          293     block_table = table
      
      /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata)
          357     for i, col_meta in enumerate(pandas_metadata['columns']):
          358         if col_meta['pandas_type'] == 'datetimetz':
      --> 359             col = table[i]
          360             converted = col.to_pandas()
          361             tz = col_meta['metadata']['timezone']
      
      table.pxi in pyarrow.lib.Table.__getitem__()
      
      table.pxi in pyarrow.lib.Table.column()
      
      IndexError: Table column index 1 is out of range
      

      The reason is that `_add_any_metadata` does not check if the column it is processing (currently only datetime tz columns need such processing) is actually present in the schema.

      Working on a fix, will submit a PR.

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: