Issue Details (XML | Word | Printable)

Key: NUTCH-344
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Greg Kim
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Nutch

Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Created: 07/Aug/06 11:54 PM   Updated: 24/Sep/06 03:30 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8, 0.8.1, 0.9.0
Fix Version/s: 0.8.1, 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File cleanExpiredServerBlocks.patch 2006-08-07 11:54 PM Greg Kim 0.8 kB
Text File HttpBase.patch 2006-08-10 04:12 AM Jason Calabrese 0.7 kB
Environment: All

Resolution Date: 08/Aug/06 07:09 PM


 Description  « Hide
With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...

private static void cleanExpiredServerBlocks() {
synchronized (BLOCKED_ADDR_TO_TIME) {
while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
String host = (String) BLOCKED_ADDR_QUEUE.getLast();
long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
if (time <= System.currentTimeMillis()) { BLOCKED_ADDR_TO_TIME.remove(host); BLOCKED_ADDR_QUEUE.removeLast(); }
}
}
}

LINE3: As long as there are any entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.

Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sami Siren added a comment - 08/Aug/06 07:09 PM
I just committed this to 0.8 branch and trunk, thanks Greg!

Jason Calabrese added a comment - 10/Aug/06 04:12 AM
This fix missed 1 little change that caused BLOCKED_ADDR_TO_TIME and BLOCKED_ADDR_QUEUE to get out of sync.

To fix the problem you only need to change the remove on line 385 to:
BLOCKED_ADDR_QUEUE.remove;

I can report the the fetch is now much faster with both of these fixes


Jacob Brunson added a comment - 10/Aug/06 04:21 AM
I'm having problems with the patch committed in revision #429779. I used to be having the "fetch aborted with X hung threads" problem. After updating to this revision, fetching goes fine for a while, but then I get this error on just about every page fetch attempt:
2006-08-09 23:27:28,548 INFO fetcher.Fetcher - fetching http://www.xmission.com/~nelsonb/resources.htm
2006-08-09 23:27:28,549 ERROR http.Http - java.lang.NullPointerException
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.cleanExpiredServerBlocks(HttpBase.java:382)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:323)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:188)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144)
2006-08-09 23:27:28,549 INFO fetcher.Fetcher - fetch of http://www.xmission.com/~nelsonb/resources.htm failed with: java.lang.NullPointerException

Greg Kim added a comment - 10/Aug/06 05:06 AM
Had the correct version in my workspace; blotched the copy over to the vendor trunk. doh! Thanks Jason for catching it!

Jacob, your problem should be resolved w/ the one line patch that Jason provided.


Jason Calabrese added a comment - 10/Aug/06 02:35 PM
This issue is still marked as resolved, it needs to be re-opened so the patch will be committed to SVN