|
There are two main distinct groups of status codes, but not along the lines of success/failure - these are DB and Fetch status codes. Additionally, the number of available bits for a bitmask is very small, because the status needs to fit in a byte.
My patch in progress contains the following now: public static final byte STATUS_DB_UNFETCHED = 0x01; /** Maximum value of DB-related status. */ public static final byte STATUS_FETCH_SUCCESS = 0x21; /** Maximum value of fetch-related status. */ public static final byte STATUS_SIGNATURE = 0x41; public static boolean hasDbStatus(CrawlDatum datum) { if (datum.status <= STATUS_DB_MAX) return true; return false; } public static boolean hasFetchStatus(CrawlDatum datum) { if (datum.status > STATUS_DB_MAX && datum.status <= STATUS_FETCH_MAX) return true; return false; } ... so, I went with ranges of values. The most unwieldy switch() statements in the current code were related to the checking between DB or Fetch status, and the above two static methods handle this and simplify the code. Regarding the redirect URL - because of space constraints I'd rather use Metadata for this. We already handle metadata efficiently, so that performance doesn't suffer if we don't have any metadata to keep. It would make sense, though, to have a predefined key for this URL. Fixed in trunk, rev. 490607. As a side effect it is now possible to correctly update CrawlDB from multiple segments, even if they contain duplicate pages - the code in CrawlDbReducer will correctly apply only the latest version of CrawlDatum.
Andrzej Bialecki made changes - 28/Dec/06 12:13 AM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
if (code & FAILED != 0) {
}
will always tell us whether a URL fetch failed, regardless of how many codes we add. The way the code currently is, adding status codes is likely to break things if one is not careful to go through every place where status codes are examined to ensure that the new code is properly accounted for.
While you're changing the CrawlDatum, it might also make sense to store a second URL,e.g. that of the redirect target. I have a hunch this will be very useful.
Just some thoughts. Thanks for making this happen.
Doug