Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2036

Adding some continuous crawl goodies to the crawl script

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.10
    • Fix Version/s: 1.11
    • Component/s: bin, tool, util
    • Labels:
    • Patch Info:
      Patch Available
    • Flags:
      Patch

      Description

      Although Nutch does not support continuous crawling out of the box, and yes this is somehow doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature to have.

      This patch basically just adds a new parameter option to the bin/crawl script (w|-wait) which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching).

      This new parameter has the NUMBER[SUFFIX] format, if no suffix is provided the amount of time is assumed to be in seconds. Other valid suffixes are:

      s - second
      m - minutes
      h - hours
      d - days

      If a -1 value is passed to the parameter or its not used at all the default behaviour of exciting the script is used.

        Attachments

        1. NUTCH-2036-v2.patch
          7 kB
          Sebastian Nagel
        2. NUTCH-2036.patch
          7 kB
          Jorge Luis Betancourt Gonzalez

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jorgelbg Jorge Luis Betancourt Gonzalez
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: