Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.4, 1.5
-
None
-
None
-
Patch Available
Description
The command/option -get of the SegmentReader may show wrong data associated with the given URL.
To reproduce:
% mkdir -p test_readseg/urls % echo -e "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0" > test_readseg/urls/seeds % nutch inject test_readseg/crawldb test_readseg/urls Injector: starting at 2012-01-18 09:32:25 Injector: crawlDb: test_readseg/crawldb Injector: urlDir: test_readseg/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03 % nutch generate test_readseg/crawldb test_readseg/segments/ Generator: starting at 2012-01-18 09:32:30 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: test_readseg/segments/20120118093232 Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03 % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' -nocontent -noparse -nofetch -noparsedata -noparsetext SegmentReader: get 'http://nutch.apache.org/' Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Jan 18 09:32:26 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 10.0 Signature: null Metadata: _ngt_: 1326875550401test: AbcTest
The metadata and the score indicate that the CrawlDatum shown is the wrong one (that associated to http://abc.test/ but not to http://nutch.apache.org/).