[NUTCH-3067] Improve performance of FetchItemQueues if error state is preserved - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.20
Fix Version/s: 1.21
Component/s: fetcher
Labels:
None

Description

In certain cases the error state of a fetch queue needs to be
preserved, even if the queue is (currently) empty, because there might
be still URLs in the fetcher input not yet read by the QueueFeeder,
see ~~NUTCH-2947~~. To keep the queue together with its state is necessary

to skip queues together with all items queued now or to be queued
later by the QueueFeeder, if a queue exceeds the maximum configured
number of exceptions (~~NUTCH-769~~). This is mostly a performance feature,
but with implications for politeness because also HTTP 403 Forbidden
(and similar) are counted as "exceptions".

to implement an exponential backoff which slows down the fetching from sites
responding with repeated "exceptions" (~~NUTCH-2946~~).

However, there is a drawback when all "stateful" queues are preserved
until the QueueFeeder has finished reading input fetch lists: Nutch's
fetch queue implementation becomes slow if there are too many queues.
This situation / issue was observed in the first cycle of a crawl
where only the homepages of millions of sites were fetched:

about 1 million homepages per fetcher task
about 25% of the homepage URLs caused exceptions - the fetch lists was not filtered beforehand whether a site is reachable and is responding
consequently, after a certain amount of time (3-4 hours) 250k queues per task were "stateful" and preserved until the fetch list input was entirely read by the QueueFeeder
with too many queues and most of them empty (no URLs) the operations on the queues become slow and fetching almost stale (see screenshot)
many queues but few URLs queued (250k vs. 25)
most fetcher threads (190 out of 240) waiting for the lock on one of the synchronized methods of FetchItemQueues
also the QueueFeeder is affected by the lock which explains why only few URLs are queued

Important notes: this is not an issue

if no error state is preserved, that is if fetcher.max.exceptions.per.queue == -1 and fetcher.exceptions.per.queue.delay == 0.0
or if the crawl isn't too "broad" in terms of the number of different hosts (domains or IPs, depending on fetcher.queue.mode)

As possible solutions:

1. do not keep every stateful queue: drop queues which have a low exception count after a configurable amount of time. If a second URL from the same host/domain/IP is fetched after a considerably long time span (eg. 30 minutes), the effect on performance and politeness should be negligible.

2. review the implementation of FetchItemQueues and the locking (synchronized methods)

3. at least, try to prioritize QueueFeeder, for example by a method which adds multiple fetch items within one synchronized call

Details and data:

Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached)

Counts of the top (deepest) line in the stack traces of all Fetcher threads:

120             at org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
49              at org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
21              at org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
19              at java.net.PlainSocketImpl.socketConnect(java.base@11.0.24/Native Method)
18              at java.net.SocketInputStream.socketRead0(java.base@11.0.24/Native Method)
6               at java.lang.Object.wait(java.base@11.0.24/Native Method)  # waiting for HTTP/2 stream
4               at java.lang.Thread.sleep(java.base@11.0.24/Native Method)
2               at java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@11.0.24/Native Method)
1               at java.util.Collections$SynchronizedCollection.size(java.base@11.0.24/Collections.java:2017)

Full stack traces (three examples):

"FetcherThread" #38 daemon prio=5 os_prio=0 cpu=43743.17ms elapsed=15890.29s tid=0x0000752967fff800 nid=0x83a3c waiting for monitor entry  [0x000075292fcf9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
        - waiting to lock <0x000000066894b9d8> (a org.apache.nutch.fetcher.FetchItemQueues)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:301)
"FetcherThread" #72 daemon prio=5 os_prio=0 cpu=38381.67ms elapsed=15881.02s tid=0x000075292822d000 nid=0x83a91 waiting for monitor entry  [0x0000752926cfe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
        - waiting to lock <0x000000066894b9d8> (a org.apache.nutch.fetcher.FetchItemQueues)
        at org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:338)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:489)
"FetcherThread" #43 daemon prio=5 os_prio=0 cpu=39112.96ms elapsed=15889.09s tid=0x0000752928361000 nid=0x83a41 waiting for monitor entry  [0x000075292d65f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
        - waiting to lock <0x000000066894b9d8> (a org.apache.nutch.fetcher.FetchItemQueues)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:345)

Stack of the blocked QueueFeeder:

"QueueFeeder" #31 daemon prio=5 os_prio=0 cpu=19415.88ms elapsed=15926.65s tid=0x000075296780c800 nid=0x83a30 waiting for monitor entry  [0x000075292fff9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:142)
        - waiting to lock <0x000000066894b9d8> (a org.apache.nutch.fetcher.FetchItemQueues)
        at org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:136)
        at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:141)

Flamegraph of a profiler run (async-profiler) of a "stale"/slow Fetcher map task (attached)