Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9720

[Python] Long-term fate of pyarrow.parquet.ParquetDataset

    XMLWordPrintableJSON

Details

    Description

      The business logic of the python implementation of reading partitioned parquet datasets in pyarrow.parquet.ParquetDataset has been ported to C++ (ARROW-3764), and has also been optionally enabled in ParquetDataset(..) by using use_legacy_dataset=False (ARROW-8039).

      But the question still is: what do we do with this class long term?

      So for users who now do:

      dataset = pq.ParquetDataset(...)
      dataset.metadata
      table = dataset.read()
      

      what should they do in the future?
      Do we keep a class like this (but backed by the pyarrow.dataset implementation), or do we deprecate the class entirely, pointing users to `dataset = ds.dataset(..., format="parquet")` ?

      In any case, we should strive to entirely delete the current custom python implementation, but we could keep a ParquetDataset class that wraps or inherits pyarrow.dataset.FileSystemDataset and adds some parquet specifics to it (eg access to the parquet schema, the common metadata, exposing the parquet-specific constructor keywords more easily, ..).

      Features the ParquetDataset currently has that are not exactly covered by pyarrow.dataset:

      • Partitioning information (the .partitions attribute
      • Access to common metadata (.metadata_path, .common_metadata_path and .metadata attributes)
      • ParquetSchema of the dataset

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: