Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.0.2, 2.1.0
-
EC2, Spark Core, SQL, YARN
-
None
-
Patch, Important
Description
When using Dataframe write in append mode on object stores (S3 / Google Storage), the writes are taking long time to write/ getting read time out. This is because dataframe.write lists all leaf folders in the target directory. If there are lot of subfolders due to partitions, this is taking for ever.
The code is In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is an Object Store (S3, GS).
if (mode == SaveMode.Append) {
val existingPartitionColumns = Try {
resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
}.getOrElse(Seq.empty[String])
There should be a flag to skip Partition Match Check in append mode. I can work on the patch.
Attachments
Attachments
Issue Links
- is depended upon by
-
HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries
- Resolved
- links to