Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
nutchgora
-
None
-
None
-
Patch Available
Description
Having a separate GORA table for storing information about hosts (and domains?) would be very useful for :
- customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc...
- storing stats
- keeping metadata and possibly propagate them to the webpages
- keeping a copy of the robots.txt and possibly use that later to filter the webtable
- store sitemaps files and update the webtable accordingly
I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome
Attachments
Attachments
Issue Links
- depends upon
-
GORA-105 DataStoreFactory does not properly support multiple stores
- Closed
- relates to
-
NUTCH-628 Host database to keep track of host-level information
- Closed
- supercedes
-
NUTCH-1290 crawlId not supported by all Tools
- Closed