[NUTCH-158] Process Sitemap data in text, rss or xml format as well as OAI-PMH - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 0.8
Fix Version/s: None
Component/s: fetcher
Labels:
None

Description

Add support to the fetcher to look for sitemap files, download them and process them into webdb.

Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that.

I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today

RSS format/Atom Format (standard)
XML meta descroption
OAI-PMH meta description (http://www.openarchives.org/OAI/openarchivesprotocol.html)

Perhaps even a "pre crawler" that will scour for these to inject into the web db to help build your link map so you could even just index topN.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: byron miller

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Dec/05 04:58

Updated:: 01/Apr/11 14:56

Resolved:: 01/Apr/11 14:56