Issue Details (XML | Word | Printable)

Key: NUTCH-339
Type: Task Task
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Sami Siren
Votes: 2
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Nutch

Refactor nutch to allow fetcher improvements

Created: 04/Aug/06 02:17 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works Fetcher2 for .81 2007-01-25 05:52 AM chee.wu 30 kB
Text File Licensed for inclusion in ASF works patch.txt 2006-08-04 02:58 PM Andrzej Bialecki 39 kB
Text File Licensed for inclusion in ASF works patch2.txt 2006-08-04 11:32 PM Andrzej Bialecki 44 kB
Text File Licensed for inclusion in ASF works patch3.txt 2006-09-08 08:16 AM Doğacan Güney 44 kB
Text File Licensed for inclusion in ASF works patch4-fixed.txt 2006-11-25 09:41 AM Andrzej Bialecki 41 kB
Text File Licensed for inclusion in ASF works patch4-trunk.txt 2006-11-24 06:53 PM Andrzej Bialecki 39 kB
Environment: n/a

Resolution Date: 06/Feb/08 12:29 PM


 Description  « Hide
As I (and Stefan?) see it there are two major areas the current fetcher could be
improved (as in speed)

1. Politeness code and how it is implemented is the biggest
problem of current fetcher(together with robots.txt handling).
With a simple code changes like replacing it with a PriorityQueue
based solution showed very promising results in increased IO.

2. Changing fetcher to use non blocking io (this requires great amount
of work as we need to implement the protocols from scratch again).

I would like to start with working towards #1 by first refactoring
the current code (plugins actually) in following way:

1. Move robots.txt handling away from (lib-http)plugin.
Even if this is related only to http, leaving it to lib-http
does not allow other kinds of scheduling strategies to be implemented
(it is hardcoded to fetch robots.txt from the same thread when requesting
a page from a site from witch it hasn't tried to load robots.txt)

2. Move code for politeness away from (lib-http)plugin
It is really usable outside http and also the current design limits
changing of the implementation (to queue based)

Where to move these, well my suggestion is the nutch core, does anybody
see problems with this?

These code refactoring activities are to be done in a way that none
of the current functionality is (at least deliberately) changed leaving
current functionality as is thus leaving room and possibility to build
the next generation fetcher(s) without destroying the old one at same time.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Andrzej Bialecki added a comment - 04/Aug/06 02:54 PM
Great minds think alike ... I started doing exactly this, and so far my patches seem to follow all requirements.

Here's my work-in-progress patch. Warning: not tested!


Andrzej Bialecki added a comment - 04/Aug/06 02:58 PM
Work-in-progress patch containing new Fetcher2, and supporting changes in Protocol API.

Andrzej Bialecki made changes - 04/Aug/06 02:58 PM
Field Original Value New Value
Attachment patch.txt [ 12338155 ]
Uros Gruber added a comment - 04/Aug/06 03:50 PM
I check my logs and see that the main speed issue with 0.8 is actualy MapReduce work. I takes about 3-4 seconds for one page. Fetching is done 20 maybe 30 miliseconds.

I don't know it this is right place to talk about this.


Sami Siren added a comment - 04/Aug/06 04:07 PM
I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened).

We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place.


Sami Siren made changes - 04/Aug/06 04:38 PM
Affects Version/s 0.8 [ 12310224 ]
Fix Version/s 0.9.0 [ 12312013 ]
Affects Version/s 0.9.0 [ 12312013 ]
Andrzej Bialecki added a comment - 04/Aug/06 11:32 PM
This patch compiles and runs. Tested very lightly with a short fetchlist - please review & test.

Andrzej Bialecki made changes - 04/Aug/06 11:32 PM
Attachment patch2.txt [ 12338197 ]
Sami Siren added a comment - 07/Sep/06 06:16 PM
Andrzej,

are you still working with this or should I proceed as I originally planned


Andrzej Bialecki added a comment - 07/Sep/06 06:36 PM
By all means, if you have spare CPU cycles please go forward ... You can probably reuse parts of my patch related to Protocol API changes and robots handling, which if I'm not mistaken implement #1 from your list.

Doğacan Güney added a comment - 08/Sep/06 08:16 AM
I have made a few changes to Andrzej's latest patch. The biggest change is that BLOCKED_ADDR_QUEUE is now a priority queue and cleanExpiredServerBlocks should block threads a lot less. I am attaching this as patch3.txt.

Doğacan Güney made changes - 08/Sep/06 08:16 AM
Attachment patch3.txt [ 12340443 ]
Sami Siren added a comment - 14/Oct/06 02:45 AM

[[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]]

The original Fetcher is no longer being polite?

Other than that both seem to be working ok based on a very
small crawl I did.

Some thoughts about the design (or perhaps more about how I did it

-the FetchQueue implementation could be in own class(file).

-I moved also the class that handles robots parsing to core

-I used existing FibonacciHeap.java (in org.apache.nutch.util) to back
up the fething queue the priority i used was the time(in seconds) one
can again fetch from that particular site, you can then use queue.peek
to see the highest priority site (the one that should be fetched next)
and check it's time and if needed read more records from recordreader.

-I created new Object Site that i queued, those objects contained a list
of urls from that site to be fetched from that site and a real time (in
seconds) when one can fetch again).

-Queue did hide the recordreader so fetcher threads only had to deal
with this queue

-I didn't add eny special method for robots.rules in Protocol interface
(it's just like any other resource that's going to be fetched but
instead when a http url was read from recordreader for a site that has
not earlier seen the robots.txt was put as a normal resource for that
site to be fetched earlier (Site). and when that resource was fetched it
was advertiset to FetchingQueue wich then parsed it and stored it in
FetchSite object.

  • Also by using this FetchSite object I could easily implement some
    useful methods like block all urls from this site (for example when
    hostname cannot be resolved, or connections constanlty time out etc...)

Attached you can find a simple drawing I did earlier about the new
fetcher I had in mind - just for a reference if my words are confusing


Sami Siren

[demime 1.01d removed an attachment of type image/png which had a name of fetcher.png]


Andrzej Bialecki added a comment - 24/Nov/06 06:53 PM
These patches implement a queue-based Fetcher, where fetching threads don't spin-wait for blocking entries.

A few comments on the architecture of Fetcher2:

  • per-host blocking is disabled in lib-http if plugins are used with Fetcher2, because in this case the Fetcher2 handles blocking - otherwise the plugin works as before (the end effect is that you can still use the plain Fetcher with these patches).
  • RobotRules can be obtained now from any protocol, and a default dummy implementation is provided for those protocols that normally don't define them.
  • fetchlist records are read by a separate thread (QueueFeeder), and stuffed into a set of queues, based on a combination of protocol + host name (or host address, depending on a setting); i.e. for a URL "http://www.cnn.com/SPORT/index.html" the queueID will be either "http://www.cnn.com" or "http://64.236.24.28" . QueueFeeder maintains a fixed total size of the queues (N * number of fetcher threads), until it exhausts all input records.
  • each proto/host queue keeps its own information about:
  • max number of threads (maxThreads) for this proto/host combination
  • crawlDelay (when maxThreads == 1) and minCrawlDelay (when maxThreads > 1)
  • a set of items currently being processed (inProgress)
  • time when the last fetch request was finished (endTime)

Items are picked from the queue in a FIFO fashion, if inProgress.size() < maxThreads and if endTime + crawlDelay < now. Picked items are recorded in inProgress set.

  • there is one global set of queues in the fetcher, with some utility methods to keep track of the total number of queued items, and to get the first eligible item from any queue.
  • FetcherThread-s try to pick up new work items from the queues, or spin-wait if none are available yet.
  • when both the input and the queues are exhausted fetcher will finish its map operation.

In my limited experiments I didn't notice the previous effects of thread starvation, because threads don't block if they can't process current item. However, there are still issues with very slow sites (most probably we need to terminate such threads), and in case of slow sites and many pages from the same host fetch items still tend to accumulate - so at the end of the fetch the speed may be still slightly lower.

The advantage of this new architecture is that it's much much easier to understand how blocking occurs, and also that reading from input is decoupled from further processing, which should make it easier to move later on to NIO-based processing (non-blocking).

Some open issues:

  • it was quite difficult to consistently measure the fetching speed. Due to changing network conditions results vary even for the same fetchlist, and even with the same implementation - and differences can be significant (like 15 pages/s for one run vs. 3 pages/s for another run with exactly same parameters).
  • I decided for now not to use NIO. The reason is that protocol plugins don't support it, so if we switched to select-based modus operandi we would have to rewrite all protocol plugins.

Please give it a try - comments, suggestions and patches are welcome!


Andrzej Bialecki made changes - 24/Nov/06 06:53 PM
Attachment patch4-trunk.txt [ 12345638 ]
Andrzej Bialecki made changes - 24/Nov/06 07:04 PM
Assignee Sami Siren [ siren ] Andrzej Bialecki [ ab ]
Sami Siren added a comment - 24/Nov/06 09:51 PM
patch applies ok, but there's this error when I try to compile:

compile:
[echo] Compiling plugin: lib-http
[javac] Compiling 4 source files to /home/sam/tru/nutch/build/lib-http/classes
[javac] /home/sam/tru/nutch/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java:551: incompatible types
[javac] found : org.apache.nutch.protocol.http.api.RobotRulesParser.RobotRuleSet
[javac] required: org.apache.nutch.protocol.RobotRules
[javac] return robots.getRobotRulesSet(this, url);
[javac] ^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error


Andrzej Bialecki added a comment - 25/Nov/06 09:41 AM
Sorry, the patch was incomplete - please try patch4-fixed.txt instead.

Andrzej Bialecki made changes - 25/Nov/06 09:41 AM
Attachment patch4-fixed.txt [ 12345652 ]
Sami Siren added a comment - 28/Nov/06 05:14 AM
When running a test fetch with Fetcher2 I enountered this error after fetching few thousand pages (of 1 million segment):

Exception in thread "QueueFeeder" java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:244)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:296)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1433)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getPos(SequenceFileRecordReader.java:97)
at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:196)
at org.apache.nutch.fetcher.Fetcher2$QueueFeeder.run(Fetcher2.java:364)


Andrzej Bialecki added a comment - 28/Nov/06 08:26 AM
This looks weird, if anything it rather seems caused by a bug in Hadoop - are you able to run "readseg -dump" on this fetchlist?

Another idea: do you have any "lease expired" messages in your log about that time? It looks like maybe the underlying input stream has been closed.


Sami Siren added a comment - 28/Nov/06 03:53 PM
perhaps thath exception is just a consequence of something other like this:

2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 - -activeThreads=296, spinWaiting=204, fetchQueues.totalSize=0
2006-11-27 07:35:09,434 WARN fetcher.Fetcher2 - Aborting with 296 hung threads.2006-11-27 07:35:09,434 INFO mapred.LocalJobRunner - 3821 pages, 207 errors, 5.5 pages/s, 780 kb/s,

and the next log entry is:

2006-11-27 07:35:15,443 INFO mapred.JobClient - map 100% reduce 0%


Andrzej Bialecki added a comment - 28/Nov/06 04:15 PM
Ah, we are getting somewhere ... fetchQueues.totalSize=0 means that all input entries from the queues have been processed. You are running with 500 threads, out of which 296 threads are still processing requests, and 204 threads are idle because they don't have anything more to do (queues are empty).

Are you running in parsing mode? Could you please kill -SIGQUIT <pid> to produce a thread dump and see why these threads are waiting? I bet they hang on regexes or pdf parsing, or a DOMFragment bug ...

Before this happened, when spinWaiting was still close to 0, what was the maximum fetching speed? Was it higher/lower/comparable to the regular fetcher? What about the CPU usage?


Sami Siren added a comment - 28/Nov/06 05:52 PM
I am running with 300 thread, and in parsing mode

thread dump shows:

191 threads waiting on condition
at java.lang.Thread.sleep(Native Method)
at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:422)

71 waiting for monitor entry
at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.getFetchItem(Fetcher2.java:306)

  • waiting to lock <0x52fa7328> (a org.apache.nutch.fetcher.Fetcher2$FetchItemQueues)
    at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:415)

rest are runnable

cpu usage starts low but very quickly in ramps up and machine gets almost unresponsive.

fetching speed is low because all cpu goes to something else.


chee.wu added a comment - 25/Jan/07 05:52 AM
Orginally, Fetcher2 can't work togehter with Nutch.81,here I provide a new portion for it.
Though some key improvement lik "Move robots.txt handling away from (lib-http)plugin" is commented out, this new portion did acheive a 80% speed increasment in my test,comprised of orginal Fetcher.java in .81.

chee.wu made changes - 25/Jan/07 05:52 AM
Attachment Fetcher2 for .81 [ 12349579 ]
Andrzej Bialecki added a comment - 25/Jan/07 07:42 AM
Well, then this version doesn't work correctly - the "performance improvement" you see is a result of violating robots.xt and politeness settings.

Sami Siren made changes - 18/Apr/07 03:42 PM
Fix Version/s 1.0.0 [ 12312443 ]
Fix Version/s 0.9.0 [ 12312013 ]
Andrzej Bialecki made changes - 06/Feb/08 12:29 PM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Closed [ 6 ]
Andrzej Bialecki added a comment - 06/Feb/08 12:30 PM
Fetcher2 has been committed long ago - I'm closing this. If any remaining matters still need to be solved please create a separate issue.