Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12314

[Python] pq.read_pandas with use_legacy_dataset=False does not accept columns as a set (kartothek integration failure)

    XMLWordPrintableJSON

Details

    Description

      The kartothek nightly integration builds started to fail(https://github.com/ursacomputing/crossbow/runs/2303373464), I assume because of ARROW-11464 (https://github.com/apache/arrow/pull/9910).

      It seems that in the new ParquetDatasetV2 (what you get with use_legacy_dataset=False), the handling of the columns argument is slightly different.

      Example failure:

      _____________________ test_add_column_to_existing_index[4] _____________________
      
      store_factory = functools.partial(<function get_store_from_url at 0x7faf12e9d0e0>, 'hfs:///tmp/pytest-of-root/pytest-0/test_add_column_to_existing_in1/store')
      metadata_version = 4
      bound_build_dataset_indices = <function build_dataset_indices at 0x7faf0c509830>
      
          def test_add_column_to_existing_index(
              store_factory, metadata_version, bound_build_dataset_indices
          ):
              dataset_uuid = "dataset_uuid"
              partitions = [
                  pd.DataFrame({"p": [1, 2], "x": [100, 4500]}),
                  pd.DataFrame({"p": [4, 3], "x": [500, 10]}),
              ]
          
              dataset = store_dataframes_as_dataset(
                  dfs=partitions,
                  store=store_factory,
                  dataset_uuid=dataset_uuid,
                  metadata_version=metadata_version,
                  secondary_indices="p",
              )
              assert dataset.load_all_indices(store=store_factory()).indices.keys() == {"p"}
          
              # Create indices
      >       bound_build_dataset_indices(store_factory, dataset_uuid, columns=["x"])
      
      kartothek/io/testing/index.py:88: 
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      /opt/conda/envs/arrow/lib/python3.7/site-packages/decorator.py:231: in fun
          return caller(func, *(extras + args), **kw)
      kartothek/io_components/utils.py:277: in normalize_args
          return _wrapper(*args, **kwargs)
      kartothek/io_components/utils.py:275: in _wrapper
          return function(*args, **kwargs)
      kartothek/io/eager.py:706: in build_dataset_indices
          mp = mp.load_dataframes(store=ds_factory.store, columns=cols_to_load,)
      kartothek/io_components/metapartition.py:150: in _impl
          method_return = method(mp, *method_args, **method_kwargs)
      kartothek/io_components/metapartition.py:696: in load_dataframes
          date_as_object=dates_as_object,
      kartothek/serialization/_generic.py:122: in restore_dataframe
          date_as_object=date_as_object,
      kartothek/serialization/_parquet.py:302: in restore_dataframe
          date_as_object=date_as_object,
      kartothek/serialization/_parquet.py:249: in _restore_dataframe
          table = pq.read_pandas(reader, columns=columns)
      /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1768: in read_pandas
          source, columns=columns, use_pandas_metadata=True, **kwargs
      /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1730: in read_table
          use_pandas_metadata=use_pandas_metadata)
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      
      self = <pyarrow.parquet._ParquetDatasetV2 object at 0x7faee1ed9550>
      columns = {'x'}, use_threads = True, use_pandas_metadata = True
      
          def read(self, columns=None, use_threads=True, use_pandas_metadata=False):
              """
              Read (multiple) Parquet files as a single pyarrow.Table.
          
              Parameters
              ----------
              columns : List[str]
                  Names of columns to read from the dataset. The partition fields
                  are not automatically included (in contrast to when setting
                  ``use_legacy_dataset=True``).
              use_threads : bool, default True
                  Perform multi-threaded column reads.
              use_pandas_metadata : bool, default False
                  If True and file has custom pandas schema metadata, ensure that
                  index columns are also loaded.
          
              Returns
              -------
              pyarrow.Table
                  Content of the file as a table (of columns).
              """
              # if use_pandas_metadata, we need to include index columns in the
              # column selection, to be able to restore those in the pandas DataFrame
              metadata = self.schema.metadata
              if columns is not None and use_pandas_metadata:
                  if metadata and b'pandas' in metadata:
                      # RangeIndex can be represented as dict instead of column name
                      index_columns = [
                          col for col in _get_pandas_index_columns(metadata)
                          if not isinstance(col, dict)
                      ]
      >               columns = columns + list(set(index_columns) - set(columns))
      E               TypeError: unsupported operand type(s) for +: 'set' and 'list'
      
      /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1598: TypeError
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h