Details
Description
This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.
The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
or:
http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller
To activate this plugin, you must modify two properties in your nutch-sites.xml:
1. plugin.includes
add: urlmeta
to: <value>...</value>
ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
2. urlmeta.tags
Insert a comma-delimited list of metatags. Using the above example:
<value>corp_owner, will_it_blend, genre</value>
Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.