[SPARK-18512] FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.0.1
Fix Version/s: None
Component/s: Structured Streaming
Labels:
- bulk-closed
Environment:

AWS EMR 5.0.1
Spark 2.0.1
S3 EU-West-1 (S3A)

Description

After a few hours of streaming processing and data saving in Parquet format, I got always this exception:

java.io.FileNotFoundException: No such file or directory: s3a://xxx/_temporary/0/task_xxxx
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
	at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)

I've tried also s3:// and s3n:// but it always happens after a 3-5 hours.

Attachments

Issue Links

is related to

SPARK-18883 FileNotFoundException on _temporary directory

Resolved

HADOOP-13786 Add S3A committers for zero-rename commits to S3 endpoints

Resolved

relates to

MAPREDUCE-6478 Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Giuseppe Bonaccorso

Votes:: 4 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 20/Nov/16 11:23

Updated:: 21/May/19 04:11

Resolved:: 21/May/19 04:11