[SPARK-8794] Column pruning isn't applied beneath sample - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0

Description

I observe that certain transformations (e.g. sample) on DataFrame cause the underlying relation's support for column pruning to be disregarded in subsequent queries.

I encountered this issue while using an ML pipeline with a typical dataset of (label, features). For my particular data source (which implements PrunedScan), the 'features' column is expensive to compute while the 'label' column is cheap. The first stage of the pipeline - StringIndexer - operates only on the label and so should be quick. Yet I found that the 'features' column would be materialized. Upon investigation, the issue occurs when the dataset is split into train/test with sampling. The sampling transformation causes the pruning optimization to be lost.

See this gist for a sample program demonstrating the issue:
https://gist.github.com/EronWright/cb5fb9af46fd810194f8

Attachments

Issue Links

links to

[Github] Pull Request #7228 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Eron Wright

Shepherd:: Michael Armbrust

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Jul/15 15:24

Updated:: 17/Jul/15 22:55

Resolved:: 07/Jul/15 22:49