Nutch
  1. Nutch
  2. NUTCH-1067

Configure minimum throughput for fetcher

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.

      This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.

      Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

      1. NUTCH-1067-1.4-1.patch
        4 kB
        Markus Jelsma
      2. NUTCH-1067-1.4-2.patch
        5 kB
        Markus Jelsma
      3. NUTCH-1067-1.4-3.patch
        7 kB
        Markus Jelsma
      4. NUTCH-1067-1.4-4.patch
        7 kB
        Markus Jelsma
      5. NUTCH-1045-1.4-v2.patch
        144 kB
        Markus Jelsma

        Activity

        Hide
        behnam nikbakht added a comment -

        i can not understand why disable the threshold checker:
        throughputThresholdPages = -1;
        that cause to enforce this factor once.

        Show
        behnam nikbakht added a comment - i can not understand why disable the threshold checker: throughputThresholdPages = -1; that cause to enforce this factor once.
        Hide
        Markus Jelsma added a comment -

        Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

        Show
        Markus Jelsma added a comment - Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
        Hide
        Hudson added a comment -

        Integrated in Nutch-branch-1.4 #11 (See https://builds.apache.org/job/Nutch-branch-1.4/11/)
        NUTCH-1067 Nutch-default configuration directives missing

        markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1172585
        Files :

        • /nutch/branches/branch-1.4/conf/nutch-default.xml
        Show
        Hudson added a comment - Integrated in Nutch-branch-1.4 #11 (See https://builds.apache.org/job/Nutch-branch-1.4/11/ ) NUTCH-1067 Nutch-default configuration directives missing markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1172585 Files : /nutch/branches/branch-1.4/conf/nutch-default.xml
        Hide
        Markus Jelsma added a comment -

        Fixed again.

        Show
        Markus Jelsma added a comment - Fixed again.
        Hide
        Markus Jelsma added a comment -

        Committed fixes for NUTCH-1102 (originating issue) for 1.4 in rev. 1170557. Everything works again with a clean check out. My apologies for letting myself be fooled by not doing a ant clean more regularly.

        Thanks Julien for being to prompt!

        Show
        Markus Jelsma added a comment - Committed fixes for NUTCH-1102 (originating issue) for 1.4 in rev. 1170557. Everything works again with a clean check out. My apologies for letting myself be fooled by not doing a ant clean more regularly. Thanks Julien for being to prompt!
        Hide
        Markus Jelsma added a comment -

        Patch to fix the issues reported by Julien plus the issues found in TestFetcher.

        Show
        Markus Jelsma added a comment - Patch to fix the issues reported by Julien plus the issues found in TestFetcher.
        Hide
        Markus Jelsma added a comment -
        • Crawl and Benchmark both read the value and pass it to the fetcher. It's safe there to remove the argument.
        • There's also a problem with TestFetcher.testFetch(). This it seems, relies on the parse to work as it passes TRUE to the fetcher but doesn't set the directive. I'll override the configuration directive to TRUE there.
        • TestFetcher.testAgentNameCheck() for some reason sets the conf directive to FALSE but passes TRUE as argument.

        All source code and tests now compile again. The fetcher tests also pass without errors. I'll attach a patch now.

        Show
        Markus Jelsma added a comment - Crawl and Benchmark both read the value and pass it to the fetcher. It's safe there to remove the argument. There's also a problem with TestFetcher.testFetch(). This it seems, relies on the parse to work as it passes TRUE to the fetcher but doesn't set the directive. I'll override the configuration directive to TRUE there. TestFetcher.testAgentNameCheck() for some reason sets the conf directive to FALSE but passes TRUE as argument. All source code and tests now compile again. The fetcher tests also pass without errors. I'll attach a patch now.
        Hide
        Markus Jelsma added a comment -

        ^#$%@ i'm on it!

        Show
        Markus Jelsma added a comment - ^#$ %@ i'm on it!
        Hide
        Julien Nioche added a comment -

        At revision 1170548.

        ant clean then ant =>

        compile-core:
        [javac] /data/nutch-1.4/build.xml:96: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
        [javac] Compiling 172 source files to /data/nutch-1.4/build/classes
        [javac] /data/nutch-1.4/src/java/org/apache/nutch/crawl/Crawl.java:136: fetch(org.apache.hadoop.fs.Path,int) in org.apache.nutch.fetcher.Fetcher cannot be applied to (org.apache.hadoop.fs.Path,int,boolean)
        [javac] fetcher.fetch(segs[0], threads, org.apache.nutch.fetcher.Fetcher.isParsing(getConf())); // fetch it
        [javac] ^
        [javac] /data/nutch-1.4/src/java/org/apache/nutch/tools/Benchmark.java:234: fetch(org.apache.hadoop.fs.Path,int) in org.apache.nutch.fetcher.Fetcher cannot be applied to (org.apache.hadoop.fs.Path,int,boolean)
        [javac] fetcher.fetch(segs[0], threads, org.apache.nutch.fetcher.Fetcher.isParsing(getConf())); // fetch it
        [javac] ^
        [javac] Note: Some input files use or override a deprecated API.
        [javac] Note: Recompile with -Xlint:deprecation for details.
        [javac] Note: Some input files use unchecked or unsafe operations.
        [javac] Note: Recompile with -Xlint:unchecked for details.
        [javac] 2 errors

        BUILD FAILED

        Show
        Julien Nioche added a comment - At revision 1170548. ant clean then ant => compile-core: [javac] /data/nutch-1.4/build.xml:96: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 172 source files to /data/nutch-1.4/build/classes [javac] /data/nutch-1.4/src/java/org/apache/nutch/crawl/Crawl.java:136: fetch(org.apache.hadoop.fs.Path,int) in org.apache.nutch.fetcher.Fetcher cannot be applied to (org.apache.hadoop.fs.Path,int,boolean) [javac] fetcher.fetch(segs [0] , threads, org.apache.nutch.fetcher.Fetcher.isParsing(getConf())); // fetch it [javac] ^ [javac] /data/nutch-1.4/src/java/org/apache/nutch/tools/Benchmark.java:234: fetch(org.apache.hadoop.fs.Path,int) in org.apache.nutch.fetcher.Fetcher cannot be applied to (org.apache.hadoop.fs.Path,int,boolean) [javac] fetcher.fetch(segs [0] , threads, org.apache.nutch.fetcher.Fetcher.isParsing(getConf())); // fetch it [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 2 errors BUILD FAILED
        Hide
        Markus Jelsma added a comment -

        Committed for 1.4 in rev. 1170526.

        Show
        Markus Jelsma added a comment - Committed for 1.4 in rev. 1170526.
        Hide
        Markus Jelsma added a comment -

        Thanks Julien. Depending on your new answer in NUTCH-1102 i'll put these in today.

        Show
        Markus Jelsma added a comment - Thanks Julien. Depending on your new answer in NUTCH-1102 i'll put these in today.
        Hide
        Julien Nioche added a comment -

        see comments on NUTCH-1102
        Patch for 1.4 looks fine
        Thanks

        Show
        Julien Nioche added a comment - see comments on NUTCH-1102 Patch for 1.4 looks fine Thanks
        Hide
        Markus Jelsma added a comment -

        Julien,

        If there are no objections i'd like to commit this issue together with NUTCH-1102 soon.

        Cheers

        Show
        Markus Jelsma added a comment - Julien, If there are no objections i'd like to commit this issue together with NUTCH-1102 soon. Cheers
        Hide
        Markus Jelsma added a comment -

        Agreed, a patch with the required modifications. Also moved the hasMore stuff to its original state. It was added before i used isAlive. Conf also updated to reflect changes.

        Show
        Markus Jelsma added a comment - Agreed, a patch with the required modifications. Also moved the hasMore stuff to its original state. It was added before i used isAlive. Conf also updated to reflect changes.
        Hide
        Julien Nioche added a comment -
        • this is going to be difficuly because i measure the actual #pages/sec, that's always an integer, thoughts?

        OK, so this can't work when the number of pages per sec is < 1 which is an acceptable limitation as long as it is clearly stated in the comments for the parameter

        • hasMore() method was added because i need to check outside the class if the feeder hasMore items

        I can't see in the patch where this call is made. Is it in some custom code of yours?

        Show
        Julien Nioche added a comment - this is going to be difficuly because i measure the actual #pages/sec, that's always an integer, thoughts? OK, so this can't work when the number of pages per sec is < 1 which is an acceptable limitation as long as it is clearly stated in the comments for the parameter hasMore() method was added because i need to check outside the class if the feeder hasMore items I can't see in the patch where this call is made. Is it in some custom code of yours?
        Hide
        Markus Jelsma added a comment -

        Thanks for your comments.

        • modified the naming to use pages in conf and code as per your comment;
        • this is going to be difficuly because i measure the actual #pages/sec, that's always an integer, thoughts?
        • hasMore() method was added because i need to check outside the class if the feeder hasMore items, it was an internal to the method QueueFeeder.run(), i could make it a public attribute but choose a getter instead;
        Show
        Markus Jelsma added a comment - Thanks for your comments. modified the naming to use pages in conf and code as per your comment; this is going to be difficuly because i measure the actual #pages/sec, that's always an integer, thoughts? hasMore() method was added because i need to check outside the class if the feeder hasMore items, it was an internal to the method QueueFeeder.run(), i could make it a public attribute but choose a getter instead;
        Hide
        Julien Nioche added a comment -

        Looks good but 2 comments though :

        • fetcher.throughput.threshold -> rename to 'fetcher.throughput.threshold.pages'? This way we could also introduce a threshold based on the bytes later?
        • threshold should not be an integer but a float -> for small crawls we could have less than one page per second but still want to use the threshold for preventing things to get worse

        Out of curiosity why do you put hasMore() as a separate method?

        Thanks

        Ju

        Show
        Julien Nioche added a comment - Looks good but 2 comments though : fetcher.throughput.threshold -> rename to 'fetcher.throughput.threshold.pages'? This way we could also introduce a threshold based on the bytes later? threshold should not be an integer but a float -> for small crawls we could have less than one page per second but still want to use the threshold for preventing things to get worse Out of curiosity why do you put hasMore() as a separate method? Thanks Ju
        Hide
        Markus Jelsma added a comment -

        Assigned to Julien for review. Cheers!

        Show
        Markus Jelsma added a comment - Assigned to Julien for review. Cheers!
        Hide
        Julien Nioche added a comment -

        Markus - please assign this issue to me : that will serve as a reminder that I need to review it

        Thanks

        Show
        Julien Nioche added a comment - Markus - please assign this issue to me : that will serve as a reminder that I need to review it Thanks
        Hide
        Markus Jelsma added a comment -

        The impact of this patch is too great to be committed without review but i'd like to get it in some day

        Show
        Markus Jelsma added a comment - The impact of this patch is too great to be committed without review but i'd like to get it in some day
        Hide
        Markus Jelsma added a comment -

        Another patch. It cleans the queue the same as time bomb and reports in a similar fashion if it kicks in. Move cleaning code to new method thats being shared by timebomb and this one.

        It has two configuration options:

        • fetcher.throughput.threshold to enable/disable and set the minimum #pages/second
        • fetcher.throughput.threshold.retries to set the number of times allowed to drop below the threshold to prevent a few accidental pauses from immediately killing the queue

        It's tested in a production cluster and seems to work nicely, no more long dreadful delays when finalizing a fetch.

        Please comment on usefulness and implemenation.

        Show
        Markus Jelsma added a comment - Another patch. It cleans the queue the same as time bomb and reports in a similar fashion if it kicks in. Move cleaning code to new method thats being shared by timebomb and this one. It has two configuration options: fetcher.throughput.threshold to enable/disable and set the minimum #pages/second fetcher.throughput.threshold.retries to set the number of times allowed to drop below the threshold to prevent a few accidental pauses from immediately killing the queue It's tested in a production cluster and seems to work nicely, no more long dreadful delays when finalizing a fetch. Please comment on usefulness and implemenation.
        Hide
        Markus Jelsma added a comment -

        New patch to enable the check only when the feeder has finished and allows for a configurable number of times to exceed the threshold.

        There can be a significant number of exceptions due to the return statement used. Probably clearer to clear the queue's first.

        Show
        Markus Jelsma added a comment - New patch to enable the check only when the feeder has finished and allows for a configurable number of times to exceed the threshold. There can be a significant number of exceptions due to the return statement used. Probably clearer to clear the queue's first.
        Hide
        Markus Jelsma added a comment -

        There's a problem with the current patch: it usually reports 0 p/s at the start of the thread. At this stage numThreads downloads are in progress simultaniously. It is also possible to report 0 p/s during the fetch. Issue must be modified as not to quit on these conditions.

        It needs to:

        • interact with the feeder
        • have an additional threshold for the number of times that 0 p/s is reported

        ..and possible more.

        Show
        Markus Jelsma added a comment - There's a problem with the current patch: it usually reports 0 p/s at the start of the thread. At this stage numThreads downloads are in progress simultaniously. It is also possible to report 0 p/s during the fetch. Issue must be modified as not to quit on these conditions. It needs to: interact with the feeder have an additional threshold for the number of times that 0 p/s is reported ..and possible more.
        Hide
        Markus Jelsma added a comment -

        Patch for 1.4. It has not been thoroughly tested yet.

        Show
        Markus Jelsma added a comment - Patch for 1.4. It has not been thoroughly tested yet.

          People

          • Assignee:
            Markus Jelsma
            Reporter:
            Markus Jelsma
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development