Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2279

LinkRank fails when using Hadoop MR output compression

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.12
    • 1.16
    • webgraph

    Description

      When using MapReduce job output compression, i.e. mapreduce.output.fileoutputformat.compress=true, LinkRank can't read the results of its Counter MR job due to the additional, generated file extension.

      For example, using the default compression codec (which appears to be DEFLATE), the counter file is written to crawl/webgraph/num_nodes/part-00000.deflate. Then, the LinkRank job attempts to manually read this file to obtain the number of links using the following code:

      FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));
      

      which fails because the file part-00000 doesn't exist:

      LinkAnalysis: java.io.FileNotFoundException: File crawl/webgraph/_num_nodes_/part-00000 does not exist
              at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
              at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
              at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
              at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
              at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
              at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
              at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
              at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
              at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
              at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
      

      To reproduce, add -D mapreduce.output.fileoutputformat.compress=true to the properties for bin/nutch linkrank ...

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              naegelejd Joseph Naegele
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: