Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9325

[C++][Dataset][Python] ParquetDataset typecast on read



    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.17.1
    • None
    • C++, Python


      When reading large Parquet tables, it would be useful to have the option to cast columns to a different type. Consider a large table with double precision types (float64 and int64), the user might prefer to read these in as single precision if double precision is not required. 

      Current behavior: One must first read the table and then cast

      Desired behavior: provide an additional kwarg that allows the user to specify a target schema. This would be propagated through to ParquetFileFragment, and each fragment can be cast as soon as it is read.

      Impact: In cases where the user wants to cast all columns to single precision and the dataset has many partitions, this feature would reduce max memory required by roughly 50%.


      I've already implemented a POC using the old Dataset API, and can reimplement using the v2 dataset API, and then submit a patch.

      A couple questions:

      1. Does this feature fit in with the Arrow roadmap?

      2. Alternatively, is there a way to accomplish this already in v0.17 that I am missing?




            Unassigned Unassigned
            ceisenach Carson Eisenach
            0 Vote for this issue
            3 Start watching this issue