Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-601

Recrawling on existing crawl directory using force option

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one can crawl and recrawl in the following manner:

      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      

      This option can be used for the first crawl too:

      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      

      If one tries to crawl without the -force option when the crawl directory already exists, he/she finds a small warning along with the error message:

      # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
      Exception in thread "main" java.lang.RuntimeException: crawl already
      exists. Add -force option to recrawl.
             at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
      

        Attachments

        1. NUTCH-601v0.1.patch
          3 kB
          Susam Pal
        2. NUTCH-601v0.2.patch
          2 kB
          Susam Pal
        3. NUTCH-601v0.3.patch
          2 kB
          Susam Pal
        4. NUTCH-601v1.0.patch
          1 kB
          Susam Pal

          Activity

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              susam Susam Pal
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: