[ARROW-9720] [Python] Long-term fate of pyarrow.parquet.ParquetDataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 11.0.0
Component/s: Python
Labels:
- dataset-parquet-legacy
- dataset-parquet-read

External issue URL:
https://github.com/apache/arrow/issues/25775

Description

The business logic of the python implementation of reading partitioned parquet datasets in pyarrow.parquet.ParquetDataset has been ported to C++ (~~ARROW-3764~~), and has also been optionally enabled in ParquetDataset(..) by using use_legacy_dataset=False (~~ARROW-8039~~).

But the question still is: what do we do with this class long term?

So for users who now do:

dataset = pq.ParquetDataset(...)
dataset.metadata
table = dataset.read()

what should they do in the future?
Do we keep a class like this (but backed by the pyarrow.dataset implementation), or do we deprecate the class entirely, pointing users to `dataset = ds.dataset(..., format="parquet")` ?

In any case, we should strive to entirely delete the current custom python implementation, but we could keep a ParquetDataset class that wraps or inherits pyarrow.dataset.FileSystemDataset and adds some parquet specifics to it (eg access to the parquet schema, the common metadata, exposing the parquet-specific constructor keywords more easily, ..).

Features the ParquetDataset currently has that are not exactly covered by pyarrow.dataset:

Partitioning information (the .partitions attribute
Access to common metadata (.metadata_path, .common_metadata_path and .metadata attributes)
ParquetSchema of the dataset

Attachments

Issue Links

is a parent of

ARROW-16119 [Python] Deprecate the legacy ParquetDataset custom python-based implementation

Reopened

relates to

ARROW-15725 [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned

Open

ARROW-8047 [Python][Documentation] Document migration from ParquetDataset to pyarrow.datasets

Open

ARROW-15868 [Python] Remove the legacy ParquetDataset custom python-based implementation

Open

Activity

People

Assignee:: Unassigned

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Aug/20 09:33

Updated:: 11/Jan/23 08:08