Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17059

Allow FileFormat to specify partition pruning strategy

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • None

    Description

      Allow Spark to have pluggable pruning of input files for FileSourceScanExec by allowing FileFormat's to specify format-specific filterPartitions method.

      This is especially useful for Parquet as Spark does not currently make use of the summary metadata, instead reading the footer of all part files for a Parquet data source. This can lead to massive speedups when reading a filtered chunk of a dataset, especially when using remote storage (S3).

      Attachments

        Activity

          People

            Unassigned Unassigned
            andreweduffy Andrew Duffy
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: