Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1098

output blocks lost when speculative execution is used

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 0.12.0
    • 0.13.0
    • None
    • None

    Description

      The Sort benchmark completes successfully for me on the latest trunk (0.12.1 candidate) with speculation turned on. Validation of the Sort benchmark output, however, is failing. I see one sort output file (part-00375) that is way smaller than all the rest. In fact, it is exactly 1 block long.

      dfs ls output:
      ...
      /user/hadoopqa/sortBenchmark100/output/part-00373 <r 3> 2971688212
      /user/hadoopqa/sortBenchmark100/output/part-00374 <r 3> 2973451660
      /user/hadoopqa/sortBenchmark100/output/part-00375 <r 3> 134217728
      /user/hadoopqa/sortBenchmark100/output/part-00376 <r 3> 2972933208
      /user/hadoopqa/sortBenchmark100/output/part-00377 <r 3> 2972309956
      ...

      During the Sort Benchmark, I see 9 AlreadyBeingCreatedExceptions in the NameNode log for this file (and more of these exceptions for other files too). I also include here the 1 PendingReplicationMonitor WARN message from the NameNode log in case it's relevant:
      ...
      2007-03-08 21:56:31,747 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_-849195508701590166
      ...
      2007-03-08 22:04:35,471 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.startFile: failed to create file /user/hadoopqa/sortBenchmark100/output/part-00375 for DFSClient_task_0002_r_000375_1 on client 72.30.38.16 because pendingCreates is non-null.
      2007-03-08 22:04:35,476 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020 call error: org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /user/hadoopqa/sortBenchmark100/output/part-00375 for DFSClient_task_0002_r_000375_1 on client 72.30.38.16 because pendingCreates is non-null.
      org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /user/hadoopqa/sortBenchmark100/output/part-00375 for DFSClient_task_0002_r_000375_1 on client 72.30.38.16 because pendingCreates is non-null.
      at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:701)
      at org.apache.hadoop.dfs.NameNode.create(NameNode.java:283)
      at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:336)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:559)
      ...

      During sort validation, I get this exception in the JobTracker log:
      2007-03-08 22:51:32,017 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0003_m_001379_0: java.io.EOFException
      at java.io.DataInputStream.readFully(DataInputStream.java:180)
      at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
      at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
      at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1525)
      at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
      at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
      at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:72)
      at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)

      I also saw this in the DataNode log during sort validation, but it could be unrelated:
      2007-03-09 01:04:41,323 WARN org.apache.hadoop.dfs.DataNode: java.io.IOException: Unexpected error trying to delete block blk_-5047673597270588432. Block not found in blockMap.
      at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:596)
      at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:429)
      at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1053)
      at java.lang.Thread.run(Thread.java:619)

      Since the default for 0.12.1 will be for speculative execution to be turned off, I am assigning this to 0.13.0

      Attachments

        Activity

          People

            Unassigned Unassigned
            nidaley Nigel Daley
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: