Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40955

Allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.1
    • None
    • Input/Output, SQL
    • None

    Description

      overall

      Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters being pushed down to source. This would allow new (arbitrary) DS V2 Predicates to be pushed down to the file source. 

      Hello spark developers,

      Thank you in advance for reading. Please excuse me if I make mistakes; this is my first time working on apache/spark internals. I am asking these questions to better understand whether my proposed changes fall within the intended scope of Data Source V2 API functionality.

      Motivation / Background:

      I am working on a branch in apache/incubator-sedona to extend its support of geoparquet files to include predicate pushdown of postGIS style spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial info in file metadata. We would like to inherit as much as possible from the Parquet classes (because geoparquet basically just adds a binary geometry column). However, FileScanBuilder.scala appears to be missing some functionality I need for DSV2 Predicates.

      My understanding of the problem so far:

      The ST_* Expression must be detected as a pushable predicate (ParquetScanBuilder.scala:71) and passed as a pushedDataFilter to the parquetPartitionReaderFactory where it will be translated into a (user defined) FilterPredicate.

      The Filter class is sealed so the sedona package can’t define new Filters; DSV2 Predicate appears to be the preferred method for accomplishing this task (referred to as “V2 Filter”, SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type sources.Filter.

      Some recent work (SPARK-39139) added the ability to detect user defined functions in  DataSourceV2Strategy.translateFilterV2() > V2ExpressionBuilder.generateExpression()  , which I think could accomplish detection correctly if FileScanBuilder called DataSourceV2Strategy.translateFilterV2() instead of DataSourceStrategy.translateFilter().

      However, changing FileScanBuilder to use Predicate instead of Filter would require many changes to all file based data sources. I don’t want to spend effort making sweeping changes if the current behavior of Spark is intentional.

       

      Concluding Questions:

      Should FileScanBuilder be pushing Predicate instead of Filter for data filters being pushed down to source? Or maybe in a FileScanBuilderV2?

      If not, how can a developer of a data source push down a new (or user defined) predicate to the file source?

      Thank you again for reading. Pending feedback, I will start working on a PR for this functionality.

      beliefer cloud_fan huaxingao   have worked on DSV2 related spark issues and I welcome your input. Please ignore this if I "@" you incorrectly.

      Attachments

        Activity

          People

            Unassigned Unassigned
            marcusrm RJ Marcus
            Wenchen Fan Wenchen Fan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: