Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Duplicate
-
2.0.2
-
None
-
None
Description
Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master node
Simple Test; reading pipe delimited file and writing data to csv. Commands below are executed in spark-shell with master-url set
val df = spark.sqlContext.read.option("delimiter","|").option("quote","\u0000").csv("/home/input-files/") val emailDf=df.filter("_c3='EML'") emailDf.repartition(100).write.csv("/opt/outputFile/")
After executing the cmds above in spark-shell with master url set.
In worker1 -> Each part file is created in{{/opt/outputFile/_temporary/task-xxxxx-xxx/part-xxx-xxx}}
In worker2 -> /opt/outputFile/part-xxx => part files are generated directly under outputDirectory specified during write.
Same thing happens with coalesce(100) or without specifying repartition/coalesce!!!
Quesiton
1) why worker1 /opt/outputFile/ output directory doesn't have part-xxxx files just like in worker2? why _temporary directory is created and part-xxx-xx files reside in the {{task-xxx}}directories?
Attachments
Issue Links
- duplicates
-
SPARK-25293 Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir
- Resolved