[NUTCH-2570] Deduplication job fails to install deduplicated CrawlDb - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.15
Fix Version/s: 1.15
Component/s: crawldb
Labels:
None

Description

The DeduplicationJob ("nutch dedup") fails to install the deduplicated CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true):

% tree crawldb
crawldb
├── current
│   └── part-r-00000
│   ├── data
│   └── index
└── old
└── part-r-00000
├── data
└── index
% bin/nutch dedup crawldb
DeduplicationJob: starting at 2018-04-22 21:48:08
Deduplication: 6 documents marked as duplicates
Deduplication: Updating status of duplicate urls into crawl db.
Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/crawldb/1742327020 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:374)
at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:613)
at org.apache.nutch.util.FSUtils.replace(FSUtils.java:58)
at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:212)
at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:225)
at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:366)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:379)

% tree crawldb
crawldb
└── old
└── part-r-00000
├── data
└── index

In pseudo-distributed mode it's even worse: only the "old" CrawlDb is left without any error.

Attachments

Issue Links

links to

GitHub Pull Request #323

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Apr/18 19:57

Updated:: 01/Oct/19 14:29

Resolved:: 26/Apr/18 11:14