[ARROW-3244] [Python] Multi-file parquet loading without scan - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/19586

Description

A number of mechanism are possible to avoid having to access and read the parquet footers in a data set consisting of a number of files. In the case of a large number of data files (perhaps split with directory partitioning) and remote storage, this can be a significant overhead. This is significant from the point of view of Dask, which must have the metadata available in the client before setting up computational graphs.

Here are some suggestions of what could be done.

some parquet writing frameworks include a `_metadata` file, which contains all the information from the footers of the various files. If this file is present, then this data can be read from one place, with a single file access. For a large number of files, parsing the thrift information may, by itself, be a non-negligible overhead≥
the schema (dtypes) can be found in a `_common_metadata`, or from any one of the data-files, then the schema could be assumed (perhaps at the user's option) to be the same for all of the files. However, the information about the directory partitioning would not be available. Although Dask may infer the information from the filenames, it would be preferable to go through the machinery with parquet-cpp, and view the whole data-set as a single object. Note that the files will still need to have the footer read to access the data, for the bytes offsets, but from Dask's point of view, this would be deferred to tasks running in parallel.

(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at https://github.com/dask/fastparquet/issues/374 as a feature that is useful in fastparquet)

Attachments

Issue Links

is related to

ARROW-8062 [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Martin Durant

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Sep/18 23:48

Updated:: 11/Jan/23 07:26

Resolved:: 29/Aug/22 14:13