Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-601

Recrawling on existing crawl directory using force option

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one can crawl and recrawl in the following manner:

      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      

      This option can be used for the first crawl too:

      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
      

      If one tries to crawl without the -force option when the crawl directory already exists, he/she finds a small warning along with the error message:

      # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
      Exception in thread "main" java.lang.RuntimeException: crawl already
      exists. Add -force option to recrawl.
             at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
      
      1. NUTCH-601v0.1.patch
        3 kB
        Susam Pal
      2. NUTCH-601v0.2.patch
        2 kB
        Susam Pal
      3. NUTCH-601v0.3.patch
        2 kB
        Susam Pal
      4. NUTCH-601v1.0.patch
        1 kB
        Susam Pal

        Activity

        Hide
        susam Susam Pal added a comment -

        Patch attached.

        Show
        susam Susam Pal added a comment - Patch attached.
        Hide
        ab Andrzej Bialecki added a comment -

        Thank you for creating this issue. I think that perhaps the old behavior should be removed altogether - it was based on an assumption that users don't want by default to recrawl the same pages. Our experience as a community shows that this is usually not the case, and the old behavior is confusing - so why not remove this artificial limitation altogether, and if users do want to keep each cycle in a separate directory they can do this by specifying different output directories.

        Show
        ab Andrzej Bialecki added a comment - Thank you for creating this issue. I think that perhaps the old behavior should be removed altogether - it was based on an assumption that users don't want by default to recrawl the same pages. Our experience as a community shows that this is usually not the case, and the old behavior is confusing - so why not remove this artificial limitation altogether, and if users do want to keep each cycle in a separate directory they can do this by specifying different output directories.
        Hide
        susam Susam Pal added a comment -

        Attached a revised patch (NUTCH-601v0.2.patch), which removes the old behaviour completely as per Andrzej's comment. Since, now we have the new behaviour only, we do not need the -force option to switch the behaviour. So, this option has been removed.

        Show
        susam Susam Pal added a comment - Attached a revised patch ( NUTCH-601 v0.2.patch), which removes the old behaviour completely as per Andrzej's comment. Since, now we have the new behaviour only, we do not need the -force option to switch the behaviour. So, this option has been removed.
        Hide
        ab Andrzej Bialecki added a comment -

        I think the section that handles the presence of an old merged index is not needed - we want to re-create it anyway, and if something bad happens it's better not to leave the old index with the new dbs/segments. So I think it's best to remove the merged index the same way as you remove the partial indexes.

        Show
        ab Andrzej Bialecki added a comment - I think the section that handles the presence of an old merged index is not needed - we want to re-create it anyway, and if something bad happens it's better not to leave the old index with the new dbs/segments. So I think it's best to remove the merged index the same way as you remove the partial indexes.
        Hide
        susam Susam Pal added a comment -

        The 'if (newIndex != index)' condition is just a check whether this is a new crawl directory being constructed, or it is a recrawl on a previous crawl directory.

        If it is a new crawl directory being constructed, a few lines above this check, there is another 'if (!fs.exists(index))' condition which will set the newIndex = index = '/index'. So, if both newIndex and index are same, we know that it is a new crawl directory and we need not delete old 'index' because it would not be present.

        If it is a recrawl over a previous crawl directory, newIndex = '/new_index' and index = '/index'. Since they are different, it indicates that this is a recrawl and thus after '/new_index' is created, it'll quickly replace the old '/index' with the '/new_index'.

        This seems fine to me.

        If you want to avoid the possibility of having new segments with old index if something bad happens, then I should delete both 'index' and 'indexes' even before the generate call. But I didn't want to delete old 'index' so early. I was trying to minimize the time for which 'index' directory is unavailable. This would be helpful in case someone is running a recrawl on the same 'crawl' directory which the web-gui is using to serve search results.

        Please let me know what you feel about this.

        Show
        susam Susam Pal added a comment - The 'if (newIndex != index)' condition is just a check whether this is a new crawl directory being constructed, or it is a recrawl on a previous crawl directory. If it is a new crawl directory being constructed, a few lines above this check, there is another 'if (!fs.exists(index))' condition which will set the newIndex = index = '/index'. So, if both newIndex and index are same, we know that it is a new crawl directory and we need not delete old 'index' because it would not be present. If it is a recrawl over a previous crawl directory, newIndex = '/new_index' and index = '/index'. Since they are different, it indicates that this is a recrawl and thus after '/new_index' is created, it'll quickly replace the old '/index' with the '/new_index'. This seems fine to me. If you want to avoid the possibility of having new segments with old index if something bad happens, then I should delete both 'index' and 'indexes' even before the generate call. But I didn't want to delete old 'index' so early. I was trying to minimize the time for which 'index' directory is unavailable. This would be helpful in case someone is running a recrawl on the same 'crawl' directory which the web-gui is using to serve search results. Please let me know what you feel about this.
        Hide
        susam Susam Pal added a comment -

        Attached a revised patch (NUTCH-601v0.3.patch) that makes the code simpler and easier to read.

        Show
        susam Susam Pal added a comment - Attached a revised patch ( NUTCH-601 v0.3.patch) that makes the code simpler and easier to read.
        Hide
        susam Susam Pal added a comment -

        Attached another patch (NUTCH-601v1.0.patch) that always deletes the old mergex index as per the suggestion of Andrzej.

        The v0.4 patch would leave the old merged index with the new segments in case something goes wrong during the generation of new index. Whether the index merger fails or succeeds, we will always have an 'index' directory. So, after the completion of a recrawl, a user may want to verify whether the 'index' directory is the new merged index or the old merged index. This may be confusing.

        However, one advantage is that one can run a recrawl on the same crawl directory which the web-gui is using to serve the users. This patch minimizes the duration for which the index directory would be unavailable.

        The v1.0 patch always deletes the old indexes as well as old merged index. Therefore, the old index would never remain once the index generation has begun. If the index merger fails, we won't have an 'index' directory which would be a clear indication of index generation failure. This prevents the confusion discussed above.

        Please review both the patches and accept whichever the community feels is better.

        Show
        susam Susam Pal added a comment - Attached another patch ( NUTCH-601 v1.0.patch) that always deletes the old mergex index as per the suggestion of Andrzej. The v0.4 patch would leave the old merged index with the new segments in case something goes wrong during the generation of new index. Whether the index merger fails or succeeds, we will always have an 'index' directory. So, after the completion of a recrawl, a user may want to verify whether the 'index' directory is the new merged index or the old merged index. This may be confusing. However, one advantage is that one can run a recrawl on the same crawl directory which the web-gui is using to serve the users. This patch minimizes the duration for which the index directory would be unavailable. The v1.0 patch always deletes the old indexes as well as old merged index. Therefore, the old index would never remain once the index generation has begun. If the index merger fails, we won't have an 'index' directory which would be a clear indication of index generation failure. This prevents the confusion discussed above. Please review both the patches and accept whichever the community feels is better.
        Hide
        freakman Erol added a comment -

        Hello,

        I tested this patch and for now it works. I had few problems, but Susam help me

        I have only one question, request. As I'm checking right now, it looks that it checks existing crawl folder and recrawl all the sites again, but is it possible to filter them out? So to only crawl sites that we set?

        otherwise, I think it very useful patch..

        Show
        freakman Erol added a comment - Hello, I tested this patch and for now it works. I had few problems, but Susam help me I have only one question, request. As I'm checking right now, it looks that it checks existing crawl folder and recrawl all the sites again, but is it possible to filter them out? So to only crawl sites that we set? otherwise, I think it very useful patch..
        Hide
        susam Susam Pal added a comment -

        It continues the recrawl using the existing crawl directory. It generates new segments using the already existing crawl/crawldb directory. You can assume a recrawl to be a crawl resumed after a break (taken for generating and merging indexes). In other words, if you did two crawls with "depth" set as 5, effectively you have done a crawl of depth 10.

        I am not clear about what you mean by filtering. Isn't conf/crawl-urlfilter.txt enough for what you want to filter in the second crawl?

        Show
        susam Susam Pal added a comment - It continues the recrawl using the existing crawl directory. It generates new segments using the already existing crawl/crawldb directory. You can assume a recrawl to be a crawl resumed after a break (taken for generating and merging indexes). In other words, if you did two crawls with "depth" set as 5, effectively you have done a crawl of depth 10. I am not clear about what you mean by filtering. Isn't conf/crawl-urlfilter.txt enough for what you want to filter in the second crawl?
        Hide
        ab Andrzej Bialecki added a comment -

        Patch v. 1.0 applied to trunk in rev. 637122. Thank you!

        Show
        ab Andrzej Bialecki added a comment - Patch v. 1.0 applied to trunk in rev. 637122. Thank you!
        Hide
        hudson Hudson added a comment -
        Show
        hudson Hudson added a comment - Integrated in Nutch-trunk #390 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/390/ )

          People

          • Assignee:
            ab Andrzej Bialecki
            Reporter:
            susam Susam Pal
          • Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development