Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31968

write.partitionBy() creates duplicate subdirectories when user provides duplicate columns

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6
    • Fix Version/s: 2.4.7, 3.0.1, 3.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      I recently remarked that if there are duplicated elements in the argument of write.partitionBy(), then the same partition subdirectory will be created multiple times.

      For example: 

      import spark.implicits._
      
      val df: DataFrame = Seq(
        (1, "p1", "c1", 1L),
        (2, "p2", "c2", 2L),
        (2, "p1", "c2", 2L),
        (3, "p3", "c3", 3L),
        (3, "p2", "c3", 3L),
        (3, "p3", "c3", 3L)
      ).toDF("col1", "col2", "col3", "col4")
      
      df.write
        .partitionBy("col1", "col1")  // we have "col1" twice
        .mode(SaveMode.Overwrite)
        .csv("output_dir")

      The above code will produce an output directory with this structure:

       

      output_dir
        |
        |--col1=1
        |    |--col1=1
        |
        |--col1=2
        |    |--col1=2
        |
        |--col1=3
             |--col1=3

      And we won't be able to read the output

       

      spark.read.csv("output_dir").show()
      // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `col1`;

       

      I am not sure if partitioning a dataframe twice by the same column make sense in some real-world applications, but it will cause schema inference problems in tools like AWS Glue crawler.

      Should Spark handle the deduplication of the partition columns? Or maybe throw an exception when duplicated columns are detected?

      If this behaviour is unexpected, I will work on a fix. 

        Attachments

          Activity

            People

            • Assignee:
              JinxinTang JinxinTang
              Reporter:
              qxzzxq Xuzhou Qin

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment