Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30769

insertInto() with existing column as partition key cause weird partition result

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.4
    • None
    • SQL
    • EMR 5.29.0 with Spark 2.4.4

    Description

      val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
      val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, typedLit[String](date.toString))) 
      
      if (xsc.tableExistIn(config.service, saveDatabase, s"${config.table}_partitioned")) writer.insertInto(tableName)
      else writer.partitionBy(config.dateColumn).saveAsTable(tableName)

      This code checks whether table exists in desired path. (somewhere in S3 in this case) If table already exists in path then insert a new partition with insertInto() function.

      If config.dateColumn not exists in table schema, no problem occurred. (just new column will be added) but if it is already exists in schema, Spark does not use given column as a partition key, instead it will create a hundred of partitions. Below is a part of Spark logs:
      (Note that the name of partition column is date_ymd, which is already exists in source table. original value is a date string like '2020-01-01')

      20/02/10 05:33:01 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174 s3://{my_path_at_s3}_partitioned_test/date_ymd=174
      20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62 s3://{my_path_at_s3}_partitioned_test/date_ymd=62
      20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83 s3://{my_path_at_s3}_partitioned_test/date_ymd=83
      20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231 s3://{my_path_at_s3}_partitioned_test/date_ymd=231
      20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268 s3://{my_path_at_s3}_partitioned_test/date_ymd=268
      20/02/10 05:33:04 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33 s3://{my_path_at_s3}_partitioned_test/date_ymd=33
      20/02/10 05:33:05 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40 s3://{my_path_at_s3}_partitioned_test/date_ymd=40
      rename s3://{my_path_at_s3}partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=HIVE_DEFAULT_PARTITION_ s3://{my_path_at_s3}partitioned_test/date_ymd=HIVE_DEFAULT_PARTITION_

      When I use different partition key which not in table schema such as 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I just wrote the report. (I think it is related with Hive...)

      Thanks for reading!

      Attachments

        Activity

          People

            Unassigned Unassigned
            nephtyws Woong Seok Kang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: