[NUTCH-2279] LinkRank fails when using Hadoop MR output compression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12
Fix Version/s: 1.16
Component/s: webgraph
Labels:
- patch-available

Description

When using MapReduce job output compression, i.e. mapreduce.output.fileoutputformat.compress=true, LinkRank can't read the results of its Counter MR job due to the additional, generated file extension.

For example, using the default compression codec (which appears to be DEFLATE), the counter file is written to crawl/webgraph/num_nodes/part-00000.deflate. Then, the LinkRank job attempts to manually read this file to obtain the number of links using the following code:

FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));

which fails because the file part-00000 doesn't exist:

LinkAnalysis: java.io.FileNotFoundException: File crawl/webgraph/_num_nodes_/part-00000 does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
        at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
        at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
        at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)

To reproduce, add -D mapreduce.output.fileoutputformat.compress=true to the properties for bin/nutch linkrank ...

Attachments

Issue Links

links to

GitHub Pull Request #478

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Joseph Naegele

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Jun/16 19:33

Updated:: 28/Jan/21 13:56

Resolved:: 01/Oct/19 14:24