[SPARK-30769] insertInto() with existing column as partition key cause weird partition result - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.4
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed
Environment:

EMR 5.29.0 with Spark 2.4.4

Description

val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, typedLit[String](date.toString))) 

if (xsc.tableExistIn(config.service, saveDatabase, s"${config.table}_partitioned")) writer.insertInto(tableName)
else writer.partitionBy(config.dateColumn).saveAsTable(tableName)

This code checks whether table exists in desired path. (somewhere in S3 in this case) If table already exists in path then insert a new partition with insertInto() function.

If config.dateColumn not exists in table schema, no problem occurred. (just new column will be added) but if it is already exists in schema, Spark does not use given column as a partition key, instead it will create a hundred of partitions. Below is a part of Spark logs:
(Note that the name of partition column is date_ymd, which is already exists in source table. original value is a date string like '2020-01-01')

20/02/10 05:33:01 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174 s3://{my_path_at_s3}_partitioned_test/date_ymd=174
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62 s3://{my_path_at_s3}_partitioned_test/date_ymd=62
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83 s3://{my_path_at_s3}_partitioned_test/date_ymd=83
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231 s3://{my_path_at_s3}_partitioned_test/date_ymd=231
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268 s3://{my_path_at_s3}_partitioned_test/date_ymd=268
20/02/10 05:33:04 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33 s3://{my_path_at_s3}_partitioned_test/date_ymd=33
20/02/10 05:33:05 INFO S3NativeFileSystem2: rename s3://{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40 s3://{my_path_at_s3}_partitioned_test/date_ymd=40
rename s3://{my_path_at_s3}partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=HIVE_DEFAULT_PARTITION_ s3://{my_path_at_s3}partitioned_test/date_ymd=HIVE_DEFAULT_PARTITION_

When I use different partition key which not in table schema such as 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I just wrote the report. (I think it is related with Hive...)

Thanks for reading!

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Woong Seok Kang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Feb/20 07:08

Updated:: 12/Dec/22 18:11

Resolved:: 25/May/21 01:43