Nutch
  1. Nutch
  2. NUTCH-1315

reduce speculation on but ParseOutputFormat doesn't name output files correctly?

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: 1.10
    • Component/s: parser
    • Labels:
    • Environment:

      ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls

      Description

      From time to time the Reducer log contains the following and one tasktracker gets blacklisted.

      org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-00001/data for DFSClient_attempt_201203151054_0028_r_000001_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_000001_0 on xx.xx.xx.9
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
      at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

      at org.apache.hadoop.ipc.Client.call(Client.java:1066)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
      at $Proxy2.create(Unknown Source)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
      at $Proxy2.create(Unknown Source)
      at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3245)
      at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
      at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
      at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1132)
      at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
      at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
      at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
      at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:157)
      at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:134)
      at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:92)
      at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
      at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:448)
      at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
      at org.apache.hadoop.mapred.Child.main(Child.java:249)

      I asked the hdfs-user mailing list and i got the following answer:

      "Looks like you have reduce speculation turned on, but the
      ParseOutputFormat you're using doesn't properly name its output files
      distinctly based on the task attempt ID. As a workaround you can
      probably turn off speculative execution for reduces, but you should
      also probably file a Nutch bug."

        Activity

        Hide
        Lewis John McGibbney added a comment -

        Regarding your comment e.g. does not turn on reduce speculation, my initial thought it no. I will try to confirm/iron out. Do you have any speculation settings configured for Hadoop at all?

        Show
        Lewis John McGibbney added a comment - Regarding your comment e.g. does not turn on reduce speculation, my initial thought it no. I will try to confirm/iron out. Do you have any speculation settings configured for Hadoop at all?
        Hide
        Markus Jelsma added a comment -

        Speculative task execution is enabled by default but the fetch and index jobs disable them. We have disabled speculative execution altogether at some point only because we need those slots to be free for other jobs.

        Should extended OutputFormat's take care of this? It isn't clear in MapRed's API docs whether this is a problem. The name parameter is to be unique for the task's part of the output for the entire job, which it is.

        Wouldn't including a task ID in the output name cause a mess in the final output?

        In the mean time i would indeed disable speculative execution. In my opinion and experience with Nutch and other jobs it's not really worth it. It takes empty slots that you can use for other jobs and if there are no other jobs it still takes additional CPU cycles and RAM and disk I/O for a few seconds. I must add that our network is homogenous (fallacy) and all nodes have almost equal load.

        Show
        Markus Jelsma added a comment - Speculative task execution is enabled by default but the fetch and index jobs disable them. We have disabled speculative execution altogether at some point only because we need those slots to be free for other jobs. Should extended OutputFormat's take care of this? It isn't clear in MapRed's API docs whether this is a problem. The name parameter is to be unique for the task's part of the output for the entire job, which it is. Wouldn't including a task ID in the output name cause a mess in the final output? In the mean time i would indeed disable speculative execution. In my opinion and experience with Nutch and other jobs it's not really worth it. It takes empty slots that you can use for other jobs and if there are no other jobs it still takes additional CPU cycles and RAM and disk I/O for a few seconds. I must add that our network is homogenous (fallacy) and all nodes have almost equal load.

          People

          • Assignee:
            Unassigned
            Reporter:
            Rafael
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development