Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25480

Dynamic partitioning + saveAsTable with multiple partition columns create empty directory

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      We use .saveAsTable and dynamic partitioning as our only way to write data to S3 from Spark.

      When only 1 partition column is defined for a table, .saveAsTable behaves as expected:

      • with Overwrite mode it will create a table if it doesn't exist and write the data
      • with Append mode it will append to a given partition
      • with Overwrite mode if the table exists it will overwrite the partition

      If 2 partition columns are used however, the directory is created on S3 with the SUCCESS file, but no data is actually written

      our solution is to check if the table doesn't exist, and in that case, set the partitioning mode back to static before running saveAsTable:

      spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
      df.write.mode("overwrite").partitionBy("year", "month").option("path", "s3://hbc-data-warehouse/integration/users_test").saveAsTable("users_test")
      

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dmateusp Daniel Mateus Pires

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment