Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27453

DataFrameWriter.partitionBy is Silently Dropped by DSV1



    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.4.1
    • 2.4.2, 3.0.0
    • SQL
    • None


      This is a long standing quirk of the interaction between DataFrameWriter and CreatableRelationProvider (and the other forms of the DSV1 API). Users can specify columns in partitionBy and our internal data sources will use this information. Unfortunately, for external systems, this data is silently dropped with no feedback given to the user.

      In the long run, I think that DataSourceV2 is a better answer. However, I don't think we should wait for that API to stabilize before offering some kind of solution to developers of external data sources. I also do not think we should break binary compatibility of this API, but I do think that small surgical fix could alleviate the issue.

      I would propose that we could propagate partitioning information (when present) along with the other configuration options passed to the data source in the String, String map.

      I think its very unlikely that there are both data sources that validate extra options and users who are using (no-op) partitioning with them, but out of an abundance of caution we should protect the behavior change behind a legacy flag that can be turned off.




            liwensun Liwen Sun
            marmbrus Michael Armbrust
            Michael Armbrust Michael Armbrust
            0 Vote for this issue
            5 Start watching this issue