Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.1
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      Fedora Core r6, Kernel 2.6.22-14, jdk1.6.0_12

      Description

      Fetcher2 fetches far more slowly than Fetcher1.

      Config options:
      fetcher.threads.fetch = 80
      fetcher.threads.per.host = 80
      fetcher.server.delay = 0
      generate.max.per.host = 1

      With a queue size of ~40,000, the result is:

      activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0

      with maybe a download of 1 page per second.

      Runing with -noParse makes little difference.

      CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0

      Hosts already cached by local caching NS appear to download quickly upon a re-fetch, so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast without pre-caching hosts.

      1. crawl_generate.tar.gz
        387 kB
        Roger Dunk
      2. NUTCH-721.patch
        1 kB
        Julien Nioche
      3. nutch-site.xml
        1 kB
        Roger Dunk

        Activity

        Hide
        Doğacan Güney added a comment -

        Code committed as of rev. 807485.

        I am closing this issue. Of course, there may be other reasons why Fetcher2 is slow, so feel free to create new issues if so.

        Show
        Doğacan Güney added a comment - Code committed as of rev. 807485. I am closing this issue. Of course, there may be other reasons why Fetcher2 is slow, so feel free to create new issues if so.
        Hide
        Julien Nioche added a comment -

        Sets the default value for fetcher.threads.per.host.by.ip to false

        Show
        Julien Nioche added a comment - Sets the default value for fetcher.threads.per.host.by.ip to false
        Hide
        Doğacan Güney added a comment -

        Thanks for the analysis, Julien! Can you make a patch for the conf changes so we can commit it with your name?

        Show
        Doğacan Güney added a comment - Thanks for the analysis, Julien! Can you make a patch for the conf changes so we can commit it with your name?
        Hide
        Andrzej Bialecki added a comment -

        +1. Current defaults are sub-optimal due to backward-compatibility issues with early Nutch 0.8. This should be no longer a concern.

        Show
        Andrzej Bialecki added a comment - +1. Current defaults are sub-optimal due to backward-compatibility issues with early Nutch 0.8. This should be no longer a concern.
        Hide
        Julien Nioche added a comment -

        I had another look at this issue after applying the patch from Nutch-719. I can easily reproduce the situation from the original post by setting fetcher.threads.per.host.by.ip to true. The nutch-site file sent by Rodger does not specify it so it would rely on this value by default. Once setting it to false all threads are active and the fetching is much faster.

        I have used the first 5K URLs from the fetchlist sent by Rodger and compared the perfs with by.ip set to false :

        OldFetcher :
        real 32m26.003s
        user 1m11.768s
        sys 0m10.337s

        OldFetcher :
        real 30m52.965s
        user 1m10.696s
        sys 0m10.425s

        Fetcher :
        real 31m21.924s
        user 1m12.725s
        sys 0m10.797s

        Fetcher :
        real 30m3.017s
        user 1m15.509s
        sys 0m10.909s

        I ran each step twice and as we can see the results are comparable.

        This explanation is also compliant with Steven's observation that we get 5-7 times the rate as we would hit the DNS cache for subsequent calls for URLs from non unique sites. The IP resolution is done by the QueueFeeder which explains why it is slowing down the number of URLs being available for fetching.

        I don't think that the oldFetcher allows to group URLs by IP for politeness in which case why not making fetcher.threads.per.host.by.ip default to false in the new fetcher?

        Show
        Julien Nioche added a comment - I had another look at this issue after applying the patch from Nutch-719. I can easily reproduce the situation from the original post by setting fetcher.threads.per.host.by.ip to true. The nutch-site file sent by Rodger does not specify it so it would rely on this value by default. Once setting it to false all threads are active and the fetching is much faster. I have used the first 5K URLs from the fetchlist sent by Rodger and compared the perfs with by.ip set to false : OldFetcher : real 32m26.003s user 1m11.768s sys 0m10.337s OldFetcher : real 30m52.965s user 1m10.696s sys 0m10.425s Fetcher : real 31m21.924s user 1m12.725s sys 0m10.797s Fetcher : real 30m3.017s user 1m15.509s sys 0m10.909s I ran each step twice and as we can see the results are comparable. This explanation is also compliant with Steven's observation that we get 5-7 times the rate as we would hit the DNS cache for subsequent calls for URLs from non unique sites. The IP resolution is done by the QueueFeeder which explains why it is slowing down the number of URLs being available for fetching. I don't think that the oldFetcher allows to group URLs by IP for politeness in which case why not making fetcher.threads.per.host.by.ip default to false in the new fetcher?
        Hide
        Doğacan Güney added a comment -

        Steven, if you have time/hardware, can you retry your use-case with OIdFetcher in trunk?

        Show
        Doğacan Güney added a comment - Steven, if you have time/hardware, can you retry your use-case with OIdFetcher in trunk?
        Hide
        Steven Denny added a comment -

        I've done some testing on this and looked at the number of pages being fed, as this obvioulsy limits the number of pages you can fetch:

        2009-07-10 04:01:35,296 INFO fetcher.Fetcher - Fed 500 urls in 186 secs = 2.7url/s
        2009-07-10 04:04:18,343 INFO fetcher.Fetcher - Fed 499 urls in 163 secs = 3.1url/s
        2009-07-10 04:06:57,109 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s
        2009-07-10 04:10:38,282 INFO fetcher.Fetcher - Fed 499 urls in 221 secs = 2.3url/s
        2009-07-10 04:12:58,371 INFO fetcher.Fetcher - Fed 498 urls in 140 secs = 3.6url/s
        2009-07-10 04:16:12,275 INFO fetcher.Fetcher - Fed 499 urls in 193 secs = 2.6url/s
        2009-07-10 04:19:20,162 INFO fetcher.Fetcher - Fed 499 urls in 187 secs = 2.7url/s
        2009-07-10 04:21:25,846 INFO fetcher.Fetcher - Fed 499 urls in 125 secs = 4.0url/s
        2009-07-10 04:24:16,049 INFO fetcher.Fetcher - Fed 495 urls in 170 secs = 2.9url/s
        2009-07-10 04:27:01,944 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s
        2009-07-10 04:29:26,247 INFO fetcher.Fetcher - Fed 499 urls in 144 secs = 3.5url/s
        2009-07-10 04:32:02,590 INFO fetcher.Fetcher - Fed 499 urls in 156 secs = 3.2url/s
        2009-07-10 04:34:49,985 INFO fetcher.Fetcher - Fed 498 urls in 167 secs = 3.0url/s
        2009-07-10 04:37:28,367 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s
        2009-07-10 04:40:09,865 INFO fetcher.Fetcher - Fed 499 urls in 161 secs = 3.1url/s
        2009-07-10 04:42:55,203 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s

        That test was one a crawldb just initialised with 11,000 urls (unique sites).

        However, on the next iteration where I'm feeding urls from non-unique sites, I see 5-7 times that rate. (My test system is a vm on our network, with nothing special in terms of DNS. Someof the look ups were taking 5-6 seconds).

        Show
        Steven Denny added a comment - I've done some testing on this and looked at the number of pages being fed, as this obvioulsy limits the number of pages you can fetch: 2009-07-10 04:01:35,296 INFO fetcher.Fetcher - Fed 500 urls in 186 secs = 2.7url/s 2009-07-10 04:04:18,343 INFO fetcher.Fetcher - Fed 499 urls in 163 secs = 3.1url/s 2009-07-10 04:06:57,109 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s 2009-07-10 04:10:38,282 INFO fetcher.Fetcher - Fed 499 urls in 221 secs = 2.3url/s 2009-07-10 04:12:58,371 INFO fetcher.Fetcher - Fed 498 urls in 140 secs = 3.6url/s 2009-07-10 04:16:12,275 INFO fetcher.Fetcher - Fed 499 urls in 193 secs = 2.6url/s 2009-07-10 04:19:20,162 INFO fetcher.Fetcher - Fed 499 urls in 187 secs = 2.7url/s 2009-07-10 04:21:25,846 INFO fetcher.Fetcher - Fed 499 urls in 125 secs = 4.0url/s 2009-07-10 04:24:16,049 INFO fetcher.Fetcher - Fed 495 urls in 170 secs = 2.9url/s 2009-07-10 04:27:01,944 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s 2009-07-10 04:29:26,247 INFO fetcher.Fetcher - Fed 499 urls in 144 secs = 3.5url/s 2009-07-10 04:32:02,590 INFO fetcher.Fetcher - Fed 499 urls in 156 secs = 3.2url/s 2009-07-10 04:34:49,985 INFO fetcher.Fetcher - Fed 498 urls in 167 secs = 3.0url/s 2009-07-10 04:37:28,367 INFO fetcher.Fetcher - Fed 498 urls in 158 secs = 3.2url/s 2009-07-10 04:40:09,865 INFO fetcher.Fetcher - Fed 499 urls in 161 secs = 3.1url/s 2009-07-10 04:42:55,203 INFO fetcher.Fetcher - Fed 499 urls in 165 secs = 3.0url/s That test was one a crawldb just initialised with 11,000 urls (unique sites). However, on the next iteration where I'm feeding urls from non-unique sites, I see 5-7 times that rate. (My test system is a vm on our network, with nothing special in terms of DNS. Someof the look ups were taking 5-6 seconds).
        Hide
        Roger Dunk added a comment -

        My tests were done on a segment with only 1 URL per host (generate.max.per.host = 1), so I don't believe what Ken has to say is the reason, at least in my case, for Fetcher2 performing slowly.

        Show
        Roger Dunk added a comment - My tests were done on a segment with only 1 URL per host (generate.max.per.host = 1), so I don't believe what Ken has to say is the reason, at least in my case, for Fetcher2 performing slowly.
        Show
        Otis Gospodnetic added a comment - Ken's thoughts: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
        Hide
        Otis Gospodnetic added a comment -

        Questions:
        Has anyone tried profiling this? (may be relevant: http://markmail.org/message/4ixrnvfycpgmkdno )

        Or maybe simply debugged/timed various blocks of code using something as simple as print statements and simple timers?

        Or maybe running just a single thread and then doing kill -QUIT a number of times to simply try and spot the method where the code seems to spend a lot of its time?

        Show
        Otis Gospodnetic added a comment - Questions: Has anyone tried profiling this? (may be relevant: http://markmail.org/message/4ixrnvfycpgmkdno ) Or maybe simply debugged/timed various blocks of code using something as simple as print statements and simple timers? Or maybe running just a single thread and then doing kill -QUIT a number of times to simply try and spot the method where the code seems to spend a lot of its time?
        Hide
        Roger Dunk added a comment -

        Julien, yes, fetcher.threads.per.host.by.ip was set to false in the above tests. I have also tried it with true, which certainly didn't help the speed issue, but I can't comment on the hung threads as I didn't bother letting the fetch complete. I'd say there are two, likely unrelated problems with Fetcher2.

        Show
        Roger Dunk added a comment - Julien, yes, fetcher.threads.per.host.by.ip was set to false in the above tests. I have also tried it with true, which certainly didn't help the speed issue, but I can't comment on the hung threads as I didn't bother letting the fetch complete. I'd say there are two, likely unrelated problems with Fetcher2.
        Hide
        Julien Nioche added a comment -

        The message about the "Aborted hung threads" looks like what I described in https://issues.apache.org/jira/browse/NUTCH-719 except that in this case there are active queues but fetchQueues.totalSize=0

        Roger : can you confirm that the parameter fetcher.threads.per.host.by.ip is set to false?

        Show
        Julien Nioche added a comment - The message about the "Aborted hung threads" looks like what I described in https://issues.apache.org/jira/browse/NUTCH-719 except that in this case there are active queues but fetchQueues.totalSize=0 Roger : can you confirm that the parameter fetcher.threads.per.host.by.ip is set to false?
        Hide
        Roger Dunk added a comment -

        I did a -"topN 5000", so only a subset of the attached, but still only 1 URL per host. The following is with 20 threads and also no parsing.

        [root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655 -threads 20 -noParsing

        [...]

        Aborting with 20 hung threads.
        Fetcher: done

        real 60m14.926s
        user 0m38.671s
        sys 0m6.134s

        Show
        Roger Dunk added a comment - I did a -"topN 5000", so only a subset of the attached, but still only 1 URL per host. The following is with 20 threads and also no parsing. [root@server1 trunk] # time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655 -threads 20 -noParsing [...] Aborting with 20 hung threads. Fetcher: done real 60m14.926s user 0m38.671s sys 0m6.134s
        Hide
        Doğacan Güney added a comment -

        Wow, 53 min vs 3 min !?

        Thanks a lot for testing and that is indeed very worrying.

        Which 5000 url set did you use? I think the crawl_generate you attached to this issue has 13K urls?

        PS: One small thing new Fetcher requires less threads than OldFetcher. If you have time can you try with
        smaller number of threads (say, 15-20)?

        Show
        Doğacan Güney added a comment - Wow, 53 min vs 3 min !? Thanks a lot for testing and that is indeed very worrying. Which 5000 url set did you use? I think the crawl_generate you attached to this issue has 13K urls? PS: One small thing new Fetcher requires less threads than OldFetcher. If you have time can you try with smaller number of threads (say, 15-20)?
        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #772 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/772/)

        • Commit old fetcher as OldFetcher for now so that we can test Fetcher2 performance.
        Show
        Hudson added a comment - Integrated in Nutch-trunk #772 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/772/ ) Commit old fetcher as OldFetcher for now so that we can test Fetcher2 performance.
        Hide
        Roger Dunk added a comment -

        For the following tests I've used the same segment containing 5000 URLs. I cleaned the named cache before the first two tests.

        [root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.OldFetcher newcrawl/segments/20090402130655/

        real 3m38.084s
        user 2m20.887s
        sys 0m7.470s

        [root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/

        [...]

        Fetcher: done

        real 53m44.800s
        user 2m20.070s
        sys 0m9.527s

        For this next test, I used the same segment but didn't clear the named cache from the previous test, so all resolvable hosts should still be cached. This appeared to help greatly, as often times out of 80 active threads, only 60 were spinwaiting (as opposed to 79 in the non-cached test), but there were still plenty of times where at least 30 consecutive log entries showed 80 threads spinwaiting. And clearly as can be seen from the times below, still nowhere in the league of OldFetcher.

        [root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/

        [...]

        Aborting with 80 hung threads.
        Fetcher: done

        real 22m5.420s
        user 2m39.407s
        sys 0m8.192s

        Show
        Roger Dunk added a comment - For the following tests I've used the same segment containing 5000 URLs. I cleaned the named cache before the first two tests. [root@server1 trunk] # time bin/nutch org.apache.nutch.fetcher.OldFetcher newcrawl/segments/20090402130655/ real 3m38.084s user 2m20.887s sys 0m7.470s [root@server1 trunk] # time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/ [...] Fetcher: done real 53m44.800s user 2m20.070s sys 0m9.527s For this next test, I used the same segment but didn't clear the named cache from the previous test, so all resolvable hosts should still be cached. This appeared to help greatly, as often times out of 80 active threads, only 60 were spinwaiting (as opposed to 79 in the non-cached test), but there were still plenty of times where at least 30 consecutive log entries showed 80 threads spinwaiting. And clearly as can be seen from the times below, still nowhere in the league of OldFetcher. [root@server1 trunk] # time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/ [...] Aborting with 80 hung threads. Fetcher: done real 22m5.420s user 2m39.407s sys 0m8.192s
        Hide
        Doğacan Güney added a comment - - edited

        I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher so that we can find out if this is related to new fetcher or is the side effect of some other change?

        Show
        Doğacan Güney added a comment - - edited I've committed nutch 0.9 fetcher as OldFetcher. So can you test with trunk and OldFetcher so that we can find out if this is related to new fetcher or is the side effect of some other change?
        Hide
        Doğacan Güney added a comment -

        OK, there is clearly a problem with the new fetcher.

        First, let's make sure that there is indeed a problem with the new fetcher and this is not the side effect of some other code we introduced between 0.9 and 1.0. So I suggest that we re-commit old fetcher back into trunk and do a side-by-side comparison to make sure that the problem is with the new fetcher.

        If it is with the new fetcher, then we may try to salvage Todd's work (I remember that he said that his fetcher was faster, right?).

        Show
        Doğacan Güney added a comment - OK, there is clearly a problem with the new fetcher. First, let's make sure that there is indeed a problem with the new fetcher and this is not the side effect of some other code we introduced between 0.9 and 1.0. So I suggest that we re-commit old fetcher back into trunk and do a side-by-side comparison to make sure that the problem is with the new fetcher. If it is with the new fetcher, then we may try to salvage Todd's work (I remember that he said that his fetcher was faster, right?).

          People

          • Assignee:
            Doğacan Güney
            Reporter:
            Roger Dunk
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development