Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20367

Spark silently escapes partition column names

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.1.0, 2.2.0
    • 2.2.0, 2.3.0
    • SQL
    • None

    Description

      CSV files can have arbitrary column names:

      scala> spark.range(1).select(col("id").as("Column?"), col("id")).write.option("header", true).csv("/tmp/foo")
      scala> spark.read.option("header", true).csv("/tmp/foo").schema
      res1: org.apache.spark.sql.types.StructType = StructType(StructField(Column?,StringType,true), StructField(id,StringType,true))
      

      However, once a column with characters like "?" in the name gets used in a partitioning column, the column name gets silently escaped, and reading the schema information back renders the column name with "?" turned into "%3F":

      scala> spark.range(1).select(col("id").as("Column?"), col("id")).write.partitionBy("Column?").option("header", true).csv("/tmp/bar")
      scala> spark.read.option("header", true).csv("/tmp/bar").schema
      res3: org.apache.spark.sql.types.StructType = StructType(StructField(id,StringType,true), StructField(Column%3F,IntegerType,true))
      

      The same happens for other formats, but I encountered it working with CSV, since these more often contain ugly schemas...

      Not sure if it's a bug or a feature, but it might be more intuitive to fail queries with invalid characters in the partitioning column name, rather than silently escaping the name?

      Attachments

        Activity

          People

            juliuszsompolski Juliusz Sompolski
            juliuszsompolski Juliusz Sompolski
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: