Description
Spark-SQL filter operation is a common workload in order to select specific rows from persisted data. The current implementation of Spark requires the read values to materialize (i.e. de-compress, de-code, etc...) onto memory first before applying the filters. This approach means that the filters may eventually throw away many values, resulting in wasted computations. Alternatively, evaluating the filters first and lazily materializing only the used values can save waste and improve the read performance. Lazy materialization has been employed by other distributed SQL engines such as Velox and Presto/Trino, but this approach has not yet been extended to Spark with Parquet.
SPIP: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
Attachments
Issue Links
- is related to
-
SPARK-36527 Implement lazy materialization for the vectorized Parquet reader
-
- Open
-