Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-601

Recrawling on existing crawl directory using force option

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.0.0
    • 1.0.0
    • None
    • None
    • Patch Available

    Description

      Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one can crawl and recrawl in the following manner:

      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      

      This option can be used for the first crawl too:

      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      

      If one tries to crawl without the -force option when the crawl directory already exists, he/she finds a small warning along with the error message:

      # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
      Exception in thread "main" java.lang.RuntimeException: crawl already
      exists. Add -force option to recrawl.
             at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
      

      Attachments

        1. NUTCH-601v1.0.patch
          1 kB
          Susam Pal
        2. NUTCH-601v0.3.patch
          2 kB
          Susam Pal
        3. NUTCH-601v0.2.patch
          2 kB
          Susam Pal
        4. NUTCH-601v0.1.patch
          3 kB
          Susam Pal

        Activity

          People

            ab Andrzej Bialecki
            susam Susam Pal
            Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: