Issue Details (XML | Word | Printable)

Key: NUTCH-416
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Andrzej Bialecki
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

CrawlDatum status and CrawlDbReducer refactoring

Created: 15/Dec/06 12:45 PM   Updated: 28/Dec/06 12:13 AM
Return to search
Component/s: None
Affects Version/s: 0.9.0
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

Resolution Date: 28/Dec/06 12:13 AM


 Description  « Hide
CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current values of status codes are linear, which prevents us from adding new codes in proper places. This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic ordering of status code values.

I propose to change the codes so that they are grouped into related values, with significant gaps between groups for adding new codes without causing significant reordering. I also propose to change the logic in CrawlDbReducer so that its operation is not so dependent on actual code values.

A mapping should also be added between old and new codes to facilitate backward-compatibility of existing data. This mapping should be applied on the fly, without requiring explicit data conversion.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cook added a comment - 20/Dec/06 10:39 PM
You may also want to make the status codes ORed values, so that, for example, all of the various kinds of failure all have a FAILURE code ORed in, making it clean & easy in the code to check for "any failure case" while still allowing different failure codes. So at the lowest levels, the values might be things like FAILED, FETCHED, and UNFETCHED, while REDIRECT might be (FETCHED | something), specific redirect codes would be (REDIRECT | something), specific failure codes would be (FAILED | something), etc. This way we can keep all of the specific failure codes, all the specific redirect codes, etc. while making the code cleaner and more reliable. We won't have to worry about keeping range checks or switch statements in sync if we add new codes; a statement like
if (code & FAILED != 0) {
}
will always tell us whether a URL fetch failed, regardless of how many codes we add. The way the code currently is, adding status codes is likely to break things if one is not careful to go through every place where status codes are examined to ensure that the new code is properly accounted for.

While you're changing the CrawlDatum, it might also make sense to store a second URL,e.g. that of the redirect target. I have a hunch this will be very useful.

Just some thoughts. Thanks for making this happen.

Doug


Andrzej Bialecki added a comment - 20/Dec/06 11:17 PM
There are two main distinct groups of status codes, but not along the lines of success/failure - these are DB and Fetch status codes. Additionally, the number of available bits for a bitmask is very small, because the status needs to fit in a byte.

My patch in progress contains the following now:

public static final byte STATUS_DB_UNFETCHED = 0x01;
public static final byte STATUS_DB_FETCHED = 0x02;
public static final byte STATUS_DB_GONE = 0x03;
public static final byte STATUS_DB_REDIR_TEMP = 0x04;
public static final byte STATUS_DB_REDIR_PERM = 0x05;

/** Maximum value of DB-related status. */
public static final byte STATUS_DB_MAX = 0x1f;

public static final byte STATUS_FETCH_SUCCESS = 0x21;
public static final byte STATUS_FETCH_RETRY = 0x22;
public static final byte STATUS_FETCH_REDIR_TEMP = 0x23;
public static final byte STATUS_FETCH_REDIR_PERM = 0x24;
public static final byte STATUS_FETCH_GONE = 0x25;

/** Maximum value of fetch-related status. */
public static final byte STATUS_FETCH_MAX = 0x3f;

public static final byte STATUS_SIGNATURE = 0x41;
public static final byte STATUS_INJECTED = 0x42;
public static final byte STATUS_LINKED = 0x43;

public static boolean hasDbStatus(CrawlDatum datum) { if (datum.status <= STATUS_DB_MAX) return true; return false; }

public static boolean hasFetchStatus(CrawlDatum datum) { if (datum.status > STATUS_DB_MAX && datum.status <= STATUS_FETCH_MAX) return true; return false; }

... so, I went with ranges of values. The most unwieldy switch() statements in the current code were related to the checking between DB or Fetch status, and the above two static methods handle this and simplify the code.

Regarding the redirect URL - because of space constraints I'd rather use Metadata for this. We already handle metadata efficiently, so that performance doesn't suffer if we don't have any metadata to keep. It would make sense, though, to have a predefined key for this URL.


Repository Revision Date User Message
ASF #490607 Thu Dec 28 00:03:04 UTC 2006 ab This patch addresses several issues:

* NUTCH-415 - Generator should mark selected records in CrawlDb.
  Due to increased resource consumption this step is optional.
  Application-level locking has been added to prevent concurrent
  modification of databases.

* NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
  now possible to correctly update CrawlDb from multiple segments.
  Introduce new status codes for temporary and permanent
  redirection.

* NUTCH-322 - Fix Fetcher to store redirected pages and to store
  protocol-level status. This also should fix NUTCH-273.

* Change default Fetcher behavior not to follow redirects immediately.
  Instead Fetcher will record redirects as new pages to be added to CrawlDb.
  This also partially addresses NUTCH-273.

* Detect and report when Generator creates 0-sized segments.

* Fix Injector to preserve already existing CrawlDatum if the seed list
  being injected also contains such URL.

This development was partially supported by SiteSell Inc.
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/tools/compat/CrawlDbConverter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestInjector.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestGenerator.java
MODIFY /lucene/nutch/trunk/CHANGES.txt
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
MODIFY /lucene/nutch/trunk/conf/nutch-default.xml
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/util/LockUtil.java

Andrzej Bialecki added a comment - 28/Dec/06 12:13 AM
Fixed in trunk, rev. 490607. As a side effect it is now possible to correctly update CrawlDB from multiple segments, even if they contain duplicate pages - the code in CrawlDbReducer will correctly apply only the latest version of CrawlDatum.

Andrzej Bialecki made changes - 28/Dec/06 12:13 AM
Field Original Value New Value
Status Open [ 1 ] Closed [ 6 ]
Resolution Fixed [ 1 ]