Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25480

Dynamic partitioning + saveAsTable with multiple partition columns create empty directory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.0
    • None
    • SQL

    Description

      We use .saveAsTable and dynamic partitioning as our only way to write data to S3 from Spark.

      When only 1 partition column is defined for a table, .saveAsTable behaves as expected:

      • with Overwrite mode it will create a table if it doesn't exist and write the data
      • with Append mode it will append to a given partition
      • with Overwrite mode if the table exists it will overwrite the partition

      If 2 partition columns are used however, the directory is created on S3 with the SUCCESS file, but no data is actually written

      our solution is to check if the table doesn't exist, and in that case, set the partitioning mode back to static before running saveAsTable:

      spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
      df.write.mode("overwrite").partitionBy("year", "month").option("path", "s3://hbc-data-warehouse/integration/users_test").saveAsTable("users_test")
      

       

      Attachments

        1. dynamic_partitioning.json
          13 kB
          Daniel Mateus Pires

        Activity

          People

            Unassigned Unassigned
            dmateusp Daniel Mateus Pires
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: