Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2036

Adding some continuous crawl goodies to the crawl script

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.10
    • 1.11
    • bin, tool, util
    • Patch Available
    • Patch

    Description

      Although Nutch does not support continuous crawling out of the box, and yes this is somehow doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature to have.

      This patch basically just adds a new parameter option to the bin/crawl script (w|-wait) which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching).

      This new parameter has the NUMBER[SUFFIX] format, if no suffix is provided the amount of time is assumed to be in seconds. Other valid suffixes are:

      s - second
      m - minutes
      h - hours
      d - days

      If a -1 value is passed to the parameter or its not used at all the default behaviour of exciting the script is used.

      Attachments

        1. NUTCH-2036-v2.patch
          7 kB
          Sebastian Nagel
        2. NUTCH-2036.patch
          7 kB
          Jorge Luis Betancourt Gonzalez

        Activity

          People

            Unassigned Unassigned
            jorgelbg Jorge Luis Betancourt Gonzalez
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: