[ARROW-8446] [Python][Dataset] Detect and use _metadata file in a list of file paths - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/24624

Description

From https://github.com/dask/dask/pull/6047#discussion_r402391318

When specifying a directory to ParquetDataset, we will detect if a _metadata file is present in the directory and use that to populate the metadata attribute (and not include this file in the list of "pieces", since it does not include any data).

However, when passing a list of files to ParquetDataset, with one being "_metadata", the metadata attribute is not populated, and the "_metadata" path is included as one of the ParquetDatasetPiece objects instead (which leads to an ArrowIOError during the read of that piece).

We could detect it in a list of paths as well.

Note, I mentioned ParquetDataset, but if working on this, we should probably directly do it in the datasets API-based version.
Also, I labeled this as Python and not C++ for now, as this might be something that can be handled on the Python side (once the C++ side knows how to process this kind of metadata -> ~~ARROW-8062~~)

Attachments

Issue Links

is related to

ARROW-2079 [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

Open

ARROW-3154 [Python][C++] Document how to write _metadata, _common_metadata files with Parquet datasets

Resolved

ARROW-8062 [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

Resolved

relates to

ARROW-2801 [Python][C++][Dataset] Implement split_row_groups for ParquetDataset

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Apr/20 13:05

Updated:: 11/Jan/23 08:00