Nutch
  1. Nutch
  2. NUTCH-1067

Configure minimum throughput for fetcher

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.

      This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.

      Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

      1. NUTCH-1045-1.4-v2.patch
        144 kB
        Markus Jelsma
      2. NUTCH-1067-1.4-1.patch
        4 kB
        Markus Jelsma
      3. NUTCH-1067-1.4-2.patch
        5 kB
        Markus Jelsma
      4. NUTCH-1067-1.4-3.patch
        7 kB
        Markus Jelsma
      5. NUTCH-1067-1.4-4.patch
        7 kB
        Markus Jelsma

        Activity

        Markus Jelsma created issue -
        Markus Jelsma made changes -
        Field Original Value New Value
        Attachment NUTCH-1067-1.4-1.patch [ 12487441 ]
        Markus Jelsma made changes -
        Component/s fetcher [ 11591 ]
        Component/s generator [ 12311358 ]
        Markus Jelsma made changes -
        Attachment NUTCH-1067-1.4-2.patch [ 12488891 ]
        Markus Jelsma made changes -
        Attachment NUTCH-1067-1.4-3.patch [ 12489859 ]
        Markus Jelsma made changes -
        Assignee Markus Jelsma [ markus17 ] Julien Nioche [ jnioche ]
        Markus Jelsma made changes -
        Attachment NUTCH-1067-1.4-4.patch [ 12491202 ]
        Markus Jelsma made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Julien Nioche [ jnioche ] Markus Jelsma [ markus17 ]
        Fix Version/s 2.0 [ 12314893 ]
        Resolution Fixed [ 1 ]
        Julien Nioche made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Markus Jelsma made changes -
        Attachment NUTCH-1045-1.4-v2.patch [ 12494432 ]
        Markus Jelsma made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Markus Jelsma made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Markus Jelsma
            Reporter:
            Markus Jelsma
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development