[SPARK-18917] Dataframe - Time Out Issues / Taking long time in append mode on object stores - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.2, 2.1.0
Fix Version/s: 2.2.0
Component/s: EC2, Spark Core, SQL, YARN
Labels:
None

Flags:

Patch, Important

Description

When using Dataframe write in append mode on object stores (S3 / Google Storage), the writes are taking long time to write/ getting read time out. This is because dataframe.write lists all leaf folders in the target directory. If there are lot of subfolders due to partitions, this is taking for ever.

The code is In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is an Object Store (S3, GS).
if (mode == SaveMode.Append) {
val existingPartitionColumns = Try {
resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
}.getOrElse(Seq.empty[String])
There should be a flag to skip Partition Match Check in append mode. I can work on the patch.

Attachments

Issue Links

is depended upon by

HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries

Resolved

links to

[Github] Pull Request #16339 (alunarbeach)

[Github] Pull Request #16622 (rxin)

Activity

People

Assignee:: Reynold Xin

Reporter:: Anbu Cheeralan

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 18/Dec/16 06:53

Updated:: 17/May/20 18:13

Resolved:: 17/Jan/17 23:06

Time Tracking

Estimated:

72h

Remaining:

72h

Logged:

Not Specified