[ARROW-8062] [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: C++, Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/24275

Description

Partitioned parquet datasets sometimes come with _metadata / _common_metadata files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for _metadata).

Using those files during the creation of a parquet Dataset can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).

Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.
Such logic could be put in a different factory class, eg ParquetManifestFactory (as suggestetd by fsaintjacques).

Attachments

Issue Links

relates to

ARROW-8446 [Python][Dataset] Detect and use _metadata file in a list of file paths

Open

ARROW-8874 [C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error

Closed

ARROW-2079 [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

Open

ARROW-3244 [Python] Multi-file parquet loading without scan

Resolved

ARROW-8733 [C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata

Resolved

links to

GitHub Pull Request #7180

(1 links to)

Activity

People

Assignee:: Francois Saint-Jacques

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 10/Mar/20 18:15

Updated:: 11/Jan/23 07:57

Resolved:: 25/May/20 19:20

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: