[SPARK-25292] Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: 2.0.2
Fix Version/s: None
Component/s: EC2, Java API, Spark Shell, Spark Submit
Labels:
None

Description

https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating

Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master node

Simple Test; reading pipe delimited file and writing data to csv. Commands below are executed in spark-shell with master-url set

val df = spark.sqlContext.read.option("delimiter","|").option("quote","\u0000").csv("/home/input-files/") val emailDf=df.filter("_c3='EML'") emailDf.repartition(100).write.csv("/opt/outputFile/")

After executing the cmds above in spark-shell with master url set.

In worker1 -> Each part file is created in{{/opt/outputFile/_temporary/task-xxxxx-xxx/part-xxx-xxx}}
In worker2 -> /opt/outputFile/part-xxx => part files are generated directly under outputDirectory specified during write.

Same thing happens with coalesce(100) or without specifying repartition/coalesce!!!

Quesiton

1) why worker1 /opt/outputFile/ output directory doesn't have part-xxxx files just like in worker2? why _temporary directory is created and part-xxx-xx files reside in the {{task-xxx}}directories?

Attachments

Issue Links

duplicates

SPARK-25293 Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: omkar puttagunta

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Aug/18 04:44

Updated:: 31/Aug/18 05:50

Resolved:: 31/Aug/18 05:50