Description
I noticed that the PR for SPARK-26188 changed how mixed-cased partition columns are handled when the user provides a schema.
Say I have this file structure (note that each instance of `pS` is mixed case):
bash-3.2$ find partitioned5 -type d partitioned5 partitioned5/pi=2 partitioned5/pi=2/pS=foo partitioned5/pi=2/pS=bar partitioned5/pi=1 partitioned5/pi=1/pS=foo partitioned5/pi=1/pS=bar bash-3.2$
If I load the file with a user-provided schema in 2.4 (before the PR was committed) or 2.3, I see:
scala> val df = spark.read.schema("intField int, pi int, ps string").parquet("partitioned5") df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field] scala> df.printSchema root |-- intField: integer (nullable = true) |-- pi: integer (nullable = true) |-- ps: string (nullable = true) scala>
However, using 2.4 after the PR was committed. I see:
scala> val df = spark.read.schema("intField int, pi int, ps string").parquet("partitioned5") df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field] scala> df.printSchema root |-- intField: integer (nullable = true) |-- pi: integer (nullable = true) |-- pS: string (nullable = true) scala>
Spark is picking up the mixed-case column name pS from the directory name, not the lower-case ps from my specified schema.
In all tests, spark.sql.caseSensitive is set to the default (false).
Not sure is this is an bug, but it is a difference.
Attachments
Issue Links
- is caused by
-
SPARK-26188 Spark 2.4.0 Partitioning behavior breaks backwards compatibility
- Resolved
- links to