Issue Details (XML | Word | Printable)

Key: NUTCH-322
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Andrzej Bialecki
Votes: 2
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Nutch

Fetcher discards ProtocolStatus, doesn't store redirected pages

Created: 19/Jul/06 12:10 PM   Updated: 28/Dec/06 12:16 AM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

Issue Links:
Incorporates
 

Resolution Date: 28/Dec/06 12:16 AM


 Description  « Hide
Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages.

I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value.

Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes:

  • ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad.
  • ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects.


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Repository Revision Date User Message
ASF #490607 Thu Dec 28 00:03:04 UTC 2006 ab This patch addresses several issues:

* NUTCH-415 - Generator should mark selected records in CrawlDb.
  Due to increased resource consumption this step is optional.
  Application-level locking has been added to prevent concurrent
  modification of databases.

* NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
  now possible to correctly update CrawlDb from multiple segments.
  Introduce new status codes for temporary and permanent
  redirection.

* NUTCH-322 - Fix Fetcher to store redirected pages and to store
  protocol-level status. This also should fix NUTCH-273.

* Change default Fetcher behavior not to follow redirects immediately.
  Instead Fetcher will record redirects as new pages to be added to CrawlDb.
  This also partially addresses NUTCH-273.

* Detect and report when Generator creates 0-sized segments.

* Fix Injector to preserve already existing CrawlDatum if the seed list
  being injected also contains such URL.

This development was partially supported by SiteSell Inc.
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/tools/compat/CrawlDbConverter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestInjector.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestGenerator.java
MODIFY /lucene/nutch/trunk/CHANGES.txt
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
MODIFY /lucene/nutch/trunk/conf/nutch-default.xml
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/util/LockUtil.java