[ARROW-15943] [C++] Filter which files to be read in as part of filesystem, filtered using a string - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- dataset
- good-second-issue

External issue URL:
https://github.com/apache/arrow/issues/31370

Description

There is a report from a user (see this Stack Overflow post [1]) who has used the basename_template parameter to write files to a dataset, some of which have the prefix "summary" and others which have the prefix "prediction". This data is saved in partitioned directories. They want to be able to read back in the data, so that, as well as the partition variables in their dataset, they can choose which subset (predictions vs. summaries) to read back in.

This isn't currently possible; if they try to open a dataset with a list of files, they cannot read it in as partitioned data.

A short-term solution is to suggest they change the structure of how their data is stored, but it could be useful to be able to pass in some sort of filter to determine which files get read in as a dataset.

[1] https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nicola Crane

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Mar/22 17:21

Updated:: 11/Jan/23 11:40