Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-415

Generate should mark selected records in crawlDB

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8, 0.8.1, 0.8.2, 0.9.0
    • 0.9.0
    • None
    • None

    Description

      In Nutch 0.7.x, if user ran "generate" twice without intervening "updatedb", each fetchlist would be different, because "generate" would mark selected entries as "being fetched" (by moving their fetch time one week forward).

      In Nutch 0.8 and later, crawldb is not modified at all during "generate". This means that two "generate"-s run without intervening "updatedb" will create exactly the same fetchlists, which is undesirable.

      I propose to re-implement this feature, using the same mechanism. CrawlDB update would be performed simultaneously with the first mapred job in Generator, and a modified crawldb content would be produced together with an (unsorted) fetchlist in Selector, using a custom OutputFormat (patches to follow ). Additionally, to ensure that correct version of modified crawldb is installed, I propose to add a locking mechanism, which prevents from running two processes that modify crawldb simultaneously.

      Attachments

        Activity

          People

            ab Andrzej Bialecki
            ab Andrzej Bialecki
            Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: