Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
The kartothek nightly integration builds started to fail(https://github.com/ursacomputing/crossbow/runs/2303373464), I assume because of ARROW-11464 (https://github.com/apache/arrow/pull/9910).
It seems that in the new ParquetDatasetV2 (what you get with use_legacy_dataset=False), the handling of the columns argument is slightly different.
Example failure:
_____________________ test_add_column_to_existing_index[4] _____________________ store_factory = functools.partial(<function get_store_from_url at 0x7faf12e9d0e0>, 'hfs:///tmp/pytest-of-root/pytest-0/test_add_column_to_existing_in1/store') metadata_version = 4 bound_build_dataset_indices = <function build_dataset_indices at 0x7faf0c509830> def test_add_column_to_existing_index( store_factory, metadata_version, bound_build_dataset_indices ): dataset_uuid = "dataset_uuid" partitions = [ pd.DataFrame({"p": [1, 2], "x": [100, 4500]}), pd.DataFrame({"p": [4, 3], "x": [500, 10]}), ] dataset = store_dataframes_as_dataset( dfs=partitions, store=store_factory, dataset_uuid=dataset_uuid, metadata_version=metadata_version, secondary_indices="p", ) assert dataset.load_all_indices(store=store_factory()).indices.keys() == {"p"} # Create indices > bound_build_dataset_indices(store_factory, dataset_uuid, columns=["x"]) kartothek/io/testing/index.py:88: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /opt/conda/envs/arrow/lib/python3.7/site-packages/decorator.py:231: in fun return caller(func, *(extras + args), **kw) kartothek/io_components/utils.py:277: in normalize_args return _wrapper(*args, **kwargs) kartothek/io_components/utils.py:275: in _wrapper return function(*args, **kwargs) kartothek/io/eager.py:706: in build_dataset_indices mp = mp.load_dataframes(store=ds_factory.store, columns=cols_to_load,) kartothek/io_components/metapartition.py:150: in _impl method_return = method(mp, *method_args, **method_kwargs) kartothek/io_components/metapartition.py:696: in load_dataframes date_as_object=dates_as_object, kartothek/serialization/_generic.py:122: in restore_dataframe date_as_object=date_as_object, kartothek/serialization/_parquet.py:302: in restore_dataframe date_as_object=date_as_object, kartothek/serialization/_parquet.py:249: in _restore_dataframe table = pq.read_pandas(reader, columns=columns) /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1768: in read_pandas source, columns=columns, use_pandas_metadata=True, **kwargs /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1730: in read_table use_pandas_metadata=use_pandas_metadata) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <pyarrow.parquet._ParquetDatasetV2 object at 0x7faee1ed9550> columns = {'x'}, use_threads = True, use_pandas_metadata = True def read(self, columns=None, use_threads=True, use_pandas_metadata=False): """ Read (multiple) Parquet files as a single pyarrow.Table. Parameters ---------- columns : List[str] Names of columns to read from the dataset. The partition fields are not automatically included (in contrast to when setting ``use_legacy_dataset=True``). use_threads : bool, default True Perform multi-threaded column reads. use_pandas_metadata : bool, default False If True and file has custom pandas schema metadata, ensure that index columns are also loaded. Returns ------- pyarrow.Table Content of the file as a table (of columns). """ # if use_pandas_metadata, we need to include index columns in the # column selection, to be able to restore those in the pandas DataFrame metadata = self.schema.metadata if columns is not None and use_pandas_metadata: if metadata and b'pandas' in metadata: # RangeIndex can be represented as dict instead of column name index_columns = [ col for col in _get_pandas_index_columns(metadata) if not isinstance(col, dict) ] > columns = columns + list(set(index_columns) - set(columns)) E TypeError: unsupported operand type(s) for +: 'set' and 'list' /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1598: TypeError
Attachments
Issue Links
- links to