Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8794

Column pruning isn't applied beneath sample

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.5.0
    • SQL
    • None

    Description

      I observe that certain transformations (e.g. sample) on DataFrame cause the underlying relation's support for column pruning to be disregarded in subsequent queries.

      I encountered this issue while using an ML pipeline with a typical dataset of (label, features). For my particular data source (which implements PrunedScan), the 'features' column is expensive to compute while the 'label' column is cheap. The first stage of the pipeline - StringIndexer - operates only on the label and so should be quick. Yet I found that the 'features' column would be materialized. Upon investigation, the issue occurs when the dataset is split into train/test with sampling. The sampling transformation causes the pruning optimization to be lost.

      See this gist for a sample program demonstrating the issue:
      https://gist.github.com/EronWright/cb5fb9af46fd810194f8

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            eronwright Eron Wright
            Michael Armbrust Michael Armbrust
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: