Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-158

Process Sitemap data in text, rss or xml format as well as OAI-PMH

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 0.8
    • None
    • fetcher
    • None

    Description

      Add support to the fetcher to look for sitemap files, download them and process them into webdb.

      Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that.

      I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today

      Perhaps even a "pre crawler" that will scour for these to inject into the web db to help build your link map so you could even just index topN.

      Attachments

        Activity

          People

            Unassigned Unassigned
            byronm byron miller
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: