Details
-
Improvement
-
Status: Closed
-
Trivial
-
Resolution: Won't Fix
-
0.8
-
None
-
None
-
All Nutch versions
Description
Hi,
I have written a new plugin, based on the URLFilter interface: urlfilter-db .
The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
For each url
filter is called
end for
filter
get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
end filter
The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml