Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.17.1
-
None
Description
When reading large Parquet tables, it would be useful to have the option to cast columns to a different type. Consider a large table with double precision types (float64 and int64), the user might prefer to read these in as single precision if double precision is not required.
Current behavior: One must first read the table and then cast
Desired behavior: provide an additional kwarg that allows the user to specify a target schema. This would be propagated through to ParquetFileFragment, and each fragment can be cast as soon as it is read.
Impact: In cases where the user wants to cast all columns to single precision and the dataset has many partitions, this feature would reduce max memory required by roughly 50%.
--------------
I've already implemented a POC using the old Dataset API, and can reimplement using the v2 dataset API, and then submit a patch.
A couple questions:
1. Does this feature fit in with the Arrow roadmap?
2. Alternatively, is there a way to accomplish this already in v0.17 that I am missing?