|
This functionality is already available in Nutch-0.8
Sorry for the short comment.
Actually the meta tags functionality is already available in the 0.8 version along with a CrawlDatum object. You can build the required functionality just by developing plugins for parsing indexing and querying.... HTH. Hi Stefan, Indeed 0.8 is not release 1.0 yet but it is stable and we are using it in production. As a whole Nutch is greate and does the job right. there is a lot of tweakiing to it but once you get the whole thing configured to your liking there is not much to change after. In terms of plugin development, I do not think Java is that far from PHP so I do not think you would have hard time there. the plugins are usually pretty small code. Since most job is already done by Nutch. for example you want to check certain rule and based on this rule to add some information into the index so you can later search your index based on that tag. The way to go about it would be to develop a parse filter plugin. This plugin is called during the parse phase usualy it happens right after fetching unless disabled in conf. Than you would add an index plugin that will take that meta data and store it in your index as a new field. The last thing to do is write a query plugin that will enable you to search the index based on the field you added in your indexing phase. HTH. Gal. These kind of questions should be sent through the user list and not Jira. I'm closing this issue, because this functionality can be achieved by using a combination of CrawlDatum.metaData and url/scoring filters.
Does somebody have an existing demo-plugin for that, that would catch URL-prefixes from a file and in case matches are found certain tags are then added? I don't yet fully get it how to do it "the elegant way"
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NUTCH-173because I imagine that per-host-settings might allow to add meta-data-tags as described here as well as perform various other host-/site-/URL-specific tasks.