[ARROW-9325] [C++][Dataset][Python] ParquetDataset typecast on read - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.17.1
Fix Version/s: None
Component/s: C++, Python
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/25411

Description

When reading large Parquet tables, it would be useful to have the option to cast columns to a different type. Consider a large table with double precision types (float64 and int64), the user might prefer to read these in as single precision if double precision is not required.

Current behavior: One must first read the table and then cast

Desired behavior: provide an additional kwarg that allows the user to specify a target schema. This would be propagated through to ParquetFileFragment, and each fragment can be cast as soon as it is read.

Impact: In cases where the user wants to cast all columns to single precision and the dataset has many partitions, this feature would reduce max memory required by roughly 50%.

--------------

I've already implemented a POC using the old Dataset API, and can reimplement using the v2 dataset API, and then submit a patch.

A couple questions:

1. Does this feature fit in with the Arrow roadmap?

2. Alternatively, is there a way to accomplish this already in v0.17 that I am missing?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Carson Eisenach

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Jul/20 19:38

Updated:: 11/Jan/23 08:06