Nutch
  1. Nutch
  2. NUTCH-1315

reduce speculation on but ParseOutputFormat doesn't name output files correctly?

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: 1.11
    • Component/s: parser
    • Labels:
    • Environment:

      ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls

      Description

      From time to time the Reducer log contains the following and one tasktracker gets blacklisted.

      org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-00001/data for DFSClient_attempt_201203151054_0028_r_000001_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_000001_0 on xx.xx.xx.9
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
      at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

      at org.apache.hadoop.ipc.Client.call(Client.java:1066)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
      at $Proxy2.create(Unknown Source)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
      at $Proxy2.create(Unknown Source)
      at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3245)
      at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
      at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
      at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1132)
      at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
      at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
      at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
      at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:157)
      at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:134)
      at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:92)
      at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
      at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:448)
      at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
      at org.apache.hadoop.mapred.Child.main(Child.java:249)

      I asked the hdfs-user mailing list and i got the following answer:

      "Looks like you have reduce speculation turned on, but the
      ParseOutputFormat you're using doesn't properly name its output files
      distinctly based on the task attempt ID. As a workaround you can
      probably turn off speculative execution for reduces, but you should
      also probably file a Nutch bug."

        Activity

        Lewis John McGibbney made changes -
        Fix Version/s 1.11 [ 12329358 ]
        Fix Version/s 1.10 [ 12327187 ]
        Julien Nioche made changes -
        Fix Version/s 1.10 [ 12327187 ]
        Fix Version/s 1.9 [ 12324611 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.9 [ 12324611 ]
        Fix Version/s 1.7 [ 12323281 ]
        Lewis John McGibbney made changes -
        Field Original Value New Value
        Fix Version/s 1.7 [ 12323281 ]
        Rafael created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Rafael
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development