Issue Details (XML | Word | Printable)

Key: NUTCH-339
Type: Task Task
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Sami Siren
Votes: 2
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Nutch

Refactor nutch to allow fetcher improvements

Created: 04/Aug/06 02:17 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works Fetcher2 for .81 2007-01-25 05:52 AM chee.wu 30 kB
Text File Licensed for inclusion in ASF works patch.txt 2006-08-04 02:58 PM Andrzej Bialecki 39 kB
Text File Licensed for inclusion in ASF works patch2.txt 2006-08-04 11:32 PM Andrzej Bialecki 44 kB
Text File Licensed for inclusion in ASF works patch3.txt 2006-09-08 08:16 AM Doğacan Güney 44 kB
Text File Licensed for inclusion in ASF works patch4-fixed.txt 2006-11-25 09:41 AM Andrzej Bialecki 41 kB
Text File Licensed for inclusion in ASF works patch4-trunk.txt 2006-11-24 06:53 PM Andrzej Bialecki 39 kB
Environment: n/a

Resolution Date: 06/Feb/08 12:29 PM


 Description  « Hide
As I (and Stefan?) see it there are two major areas the current fetcher could be
improved (as in speed)

1. Politeness code and how it is implemented is the biggest
problem of current fetcher(together with robots.txt handling).
With a simple code changes like replacing it with a PriorityQueue
based solution showed very promising results in increased IO.

2. Changing fetcher to use non blocking io (this requires great amount
of work as we need to implement the protocols from scratch again).

I would like to start with working towards #1 by first refactoring
the current code (plugins actually) in following way:

1. Move robots.txt handling away from (lib-http)plugin.
Even if this is related only to http, leaving it to lib-http
does not allow other kinds of scheduling strategies to be implemented
(it is hardcoded to fetch robots.txt from the same thread when requesting
a page from a site from witch it hasn't tried to load robots.txt)

2. Move code for politeness away from (lib-http)plugin
It is really usable outside http and also the current design limits
changing of the implementation (to queue based)

Where to move these, well my suggestion is the nutch core, does anybody
see problems with this?

These code refactoring activities are to be done in a way that none
of the current functionality is (at least deliberately) changed leaving
current functionality as is thus leaving room and possibility to build
the next generation fetcher(s) without destroying the old one at same time.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Andrzej Bialecki made changes - 04/Aug/06 02:58 PM
Field Original Value New Value
Attachment patch.txt [ 12338155 ]
Sami Siren made changes - 04/Aug/06 04:38 PM
Affects Version/s 0.8 [ 12310224 ]
Fix Version/s 0.9.0 [ 12312013 ]
Affects Version/s 0.9.0 [ 12312013 ]
Andrzej Bialecki made changes - 04/Aug/06 11:32 PM
Attachment patch2.txt [ 12338197 ]
Doğacan Güney made changes - 08/Sep/06 08:16 AM
Attachment patch3.txt [ 12340443 ]
Andrzej Bialecki made changes - 24/Nov/06 06:53 PM
Attachment patch4-trunk.txt [ 12345638 ]
Andrzej Bialecki made changes - 24/Nov/06 07:04 PM
Assignee Sami Siren [ siren ] Andrzej Bialecki [ ab ]
Andrzej Bialecki made changes - 25/Nov/06 09:41 AM
Attachment patch4-fixed.txt [ 12345652 ]
chee.wu made changes - 25/Jan/07 05:52 AM
Attachment Fetcher2 for .81 [ 12349579 ]
Sami Siren made changes - 18/Apr/07 03:42 PM
Fix Version/s 1.0.0 [ 12312443 ]
Fix Version/s 0.9.0 [ 12312013 ]
Andrzej Bialecki made changes - 06/Feb/08 12:29 PM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Closed [ 6 ]