Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.12
Description
When using MapReduce job output compression, i.e. mapreduce.output.fileoutputformat.compress=true, LinkRank can't read the results of its Counter MR job due to the additional, generated file extension.
For example, using the default compression codec (which appears to be DEFLATE), the counter file is written to crawl/webgraph/num_nodes/part-00000.deflate. Then, the LinkRank job attempts to manually read this file to obtain the number of links using the following code:
FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));
which fails because the file part-00000 doesn't exist:
LinkAnalysis: java.io.FileNotFoundException: File crawl/webgraph/_num_nodes_/part-00000 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767) at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124) at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633) at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
To reproduce, add -D mapreduce.output.fileoutputformat.compress=true to the properties for bin/nutch linkrank ...
Attachments
Issue Links
- links to