Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18883

FileNotFoundException on _temporary directory

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.2
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None
    • Environment:

      We're on a CDH 5.7, Hadoop 2.6.

      Description

      I'm experiencing the following exception, usually after some time with heavy load :

      16/12/15 11:25:18 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
      java.io.FileNotFoundException: File hdfs://nameservice1/user/xdstore/rfs/rfsDB/_temporary/0 does not exist.
              at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)
              at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)
              at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)
              at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)
              at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
              at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:860)
              at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
              at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
              at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:291)
              at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:361)
              at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
              at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
              at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
              at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
              at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
              at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
              at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
              at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
              at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
              at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
              at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
              at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
              at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
              at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:525)
              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
              at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
              at com.bluedme.woda.ng.indexer.RfsRepository.append(RfsRepository.scala:36)
              at com.bluedme.woda.ng.indexer.RfsRepository.insert(RfsRepository.scala:23)
              at com.bluedme.woda.cmd.ShareDatasetImpl.runImmediate(ShareDatasetImpl.scala:33)
              at com.bluedme.woda.cmd.ShareDatasetImpl.runImmediate(ShareDatasetImpl.scala:13)
              at com.bluedme.woda.cmd.ImmediateCommandImpl$$anonfun$run$1.apply(CommandImpl.scala:21)
              at com.bluedme.woda.cmd.ImmediateCommandImpl$$anonfun$run$1.apply(CommandImpl.scala:21)
              at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
              at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
              at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
              at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
              at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
              at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
              at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      

      Looks similar to SPARK-18512 although it's not the same environment : no streaming, no S3 here. Final path in stack different.

        Issue Links

          Activity

          Hide
          mathieude Mathieu DESPRIEE added a comment -

          as suggested by Steve Loughran, I'm going to try the mapreduce.fileoutputcommitter.algorithm.version = 2, and update this ticket.

          Show
          mathieude Mathieu DESPRIEE added a comment - as suggested by Steve Loughran , I'm going to try the mapreduce.fileoutputcommitter.algorithm.version = 2, and update this ticket.
          Hide
          srowen Sean Owen added a comment -

          I'm not clear why it's assumed to be a separate issue? Steve Loughran

          Show
          srowen Sean Owen added a comment - I'm not clear why it's assumed to be a separate issue? Steve Loughran
          Hide
          stevel@apache.org Steve Loughran added a comment -

          if it surfaces on HDFS it's not an S3 consistency issue, more something is up with the commit process, like it's been called twice. It could have the same roote cause, or it could be a sign of something up with the commit protocol. Which means the work in HADOOP-13786 and HADOOP-13445 isn't going to fix it.

          if you think it's the same, then it'd be best to start at the protocol layer and work down, maybe by adding some extra logging into Spark HadoopMapReduceCommitProtocol, or Hadoop's org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter, which could do a quick listing of the parent dir if the temp path is missing. I might see if I can do something minimal there to get into Hadoop; 2.8. Of course, if you were to supply a patch there, I could be the one to review it, so it'd be easier to get in.

          For now, set the commit algorithm =2 and at the very least, the problem will move, as the rename/merge operations get pushed out into the individual jobs.

          Show
          stevel@apache.org Steve Loughran added a comment - if it surfaces on HDFS it's not an S3 consistency issue, more something is up with the commit process, like it's been called twice. It could have the same roote cause, or it could be a sign of something up with the commit protocol. Which means the work in HADOOP-13786 and HADOOP-13445 isn't going to fix it. if you think it's the same, then it'd be best to start at the protocol layer and work down, maybe by adding some extra logging into Spark HadoopMapReduceCommitProtocol , or Hadoop's org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter , which could do a quick listing of the parent dir if the temp path is missing. I might see if I can do something minimal there to get into Hadoop; 2.8. Of course, if you were to supply a patch there, I could be the one to review it, so it'd be easier to get in. For now, set the commit algorithm =2 and at the very least, the problem will move, as the rename/merge operations get pushed out into the individual jobs.
          Hide
          mathieude Mathieu DESPRIEE added a comment -

          The problem does not appear with mapreduce.fileoutputcommitter.algorithm.version=2 so far

          Show
          mathieude Mathieu DESPRIEE added a comment - The problem does not appear with mapreduce.fileoutputcommitter.algorithm.version=2 so far
          Hide
          stevel@apache.org Steve Loughran added a comment -

          thanks, good to know

          Show
          stevel@apache.org Steve Loughran added a comment - thanks, good to know

            People

            • Assignee:
              Unassigned
              Reporter:
              mathieude Mathieu DESPRIEE
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development