Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.4
-
None
-
EMR 5.29.0 with Spark 2.4.4
Description
val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned" val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, typedLit[String](date.toString))) if (xsc.tableExistIn(config.service, saveDatabase, s"${config.table}_partitioned")) writer.insertInto(tableName) else writer.partitionBy(config.dateColumn).saveAsTable(tableName)
This code checks whether table exists in desired path. (somewhere in S3 in this case) If table already exists in path then insert a new partition with insertInto() function.
If config.dateColumn not exists in table schema, no problem occurred. (just new column will be added) but if it is already exists in schema, Spark does not use given column as a partition key, instead it will create a hundred of partitions. Below is a part of Spark logs:
(Note that the name of partition column is date_ymd, which is already exists in source table. original value is a date string like '2020-01-01')
20/02/10 05:33:01 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174 s3://{my_path_at_s3}_partitioned_test/date_ymd=174
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62 s3://{my_path_at_s3}_partitioned_test/date_ymd=62
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83 s3://{my_path_at_s3}_partitioned_test/date_ymd=83
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231 s3://{my_path_at_s3}_partitioned_test/date_ymd=231
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268 s3://{my_path_at_s3}_partitioned_test/date_ymd=268
20/02/10 05:33:04 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33 s3://{my_path_at_s3}_partitioned_test/date_ymd=33
20/02/10 05:33:05 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40 s3://{my_path_at_s3}_partitioned_test/date_ymd=40
rename s3://{my_path_at_s3}partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=HIVE_DEFAULT_PARTITION_ s3://{my_path_at_s3}partitioned_test/date_ymd=HIVE_DEFAULT_PARTITION_
When I use different partition key which not in table schema such as 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I just wrote the report. (I think it is related with Hive...)
Thanks for reading!