Issue Details (XML | Word | Printable)

Key: NUTCH-416
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Andrzej Bialecki
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

CrawlDatum status and CrawlDbReducer refactoring

Created: 15/Dec/06 12:45 PM   Updated: 28/Dec/06 12:13 AM
Return to search
Component/s: None
Affects Version/s: 0.9.0
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

Resolution Date: 28/Dec/06 12:13 AM


 Description  « Hide
CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current values of status codes are linear, which prevents us from adding new codes in proper places. This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic ordering of status code values.

I propose to change the codes so that they are grouped into related values, with significant gaps between groups for adding new codes without causing significant reordering. I also propose to change the logic in CrawlDbReducer so that its operation is not so dependent on actual code values.

A mapping should also be added between old and new codes to facilitate backward-compatibility of existing data. This mapping should be applied on the fly, without requiring explicit data conversion.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
No work has yet been logged on this issue.