Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2976

SitemapProcessor: verify sitemap values added from sitemap to CrawlDB (priority, modification time and change frequency)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.19
    • 1.21
    • sitemap
    • None

    Description

      SitemapProcesser writes values from the sitemap into the CrawlDB without verification or plausibility check:

      • priority - used as CrawlDatum score
      • modification time
      • change frequency - used as fetch interval

      Since these values in the sitemap cannot be trusted, the processor should make sure that they are in acceptable ranges:

      • priority > 0.0 (a score of 0.0 would cause that a URL is never fetched)
      • modification time: not in the future
      • change frequency / fetch interval within [db.fetch.schedule.adaptive.min_interval, db.fetch.schedule.max_interval]

      See also NUTCH-2975

      Attachments

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: