[ARROW-10131] [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/26143

Description

Related to ~~ARROW-9730~~, parsing of the statistics in parquet metadata is expensive, and therefore should be avoided when possible.

For example, the ParquetDatasetFactory (ds.parquet_dataset() in python) parses all statistics of all files and all columns. While when doing a filtered read, you might only need the statistics of certain files (eg if a filter on a partition field already excluded many files) and certain columns (eg only the columns on which you are actually filtering).

The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a later EnsureCompleteMetadata parse all statistics, and don't allow parsing a subset, or only parsing the other (non-statistics) metadata, ...), so I think we should try to think of better abstractions.

cc rjzamora bkietz

Attachments

Issue Links

links to

GitHub Pull Request #8507

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Sep/20 12:32

Updated:: 11/Jan/23 08:11

Resolved:: 29/Oct/20 20:51

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m