Description
I observe that certain transformations (e.g. sample) on DataFrame cause the underlying relation's support for column pruning to be disregarded in subsequent queries.
I encountered this issue while using an ML pipeline with a typical dataset of (label, features). For my particular data source (which implements PrunedScan), the 'features' column is expensive to compute while the 'label' column is cheap. The first stage of the pipeline - StringIndexer - operates only on the label and so should be quick. Yet I found that the 'features' column would be materialized. Upon investigation, the issue occurs when the dataset is split into train/test with sampling. The sampling transformation causes the pruning optimization to be lost.
See this gist for a sample program demonstrating the issue:
https://gist.github.com/EronWright/cb5fb9af46fd810194f8
Attachments
Issue Links
- links to