Issue Details (XML | Word | Printable)

Key: NUTCH-271
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Stefan Neufeind
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Nutch

Meta-data per URL/site/section

Created: 19/May/06 04:12 AM   Updated: 19/Jul/06 06:53 PM
Return to search
Component/s: None
Affects Version/s: 0.7.2
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Reference
 

Resolution Date: 19/Jul/06 06:21 PM


 Description  « Hide
We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a "workaround" I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g.

http://www.example1.com/something1/ -> meta-tag "companybranch1"
http://www.example2.com/something2/ -> meta-tag "companybranch2"
http://www.example3.com/something3/ -> meta-tag "companybranch1"
http://www.example4.com/something4/ -> meta-tag "companybranch3"

search for everything in companybranch1 or across 1 and 3 or similar



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Neufeind added a comment - 19/May/06 04:14 AM
I guess this issue might be related to NUTCH-173 because I imagine that per-host-settings might allow to add meta-data-tags as described here as well as perform various other host-/site-/URL-specific tasks.

Gal Nitzan added a comment - 19/May/06 05:06 AM
This functionality is already available in Nutch-0.8

Gal Nitzan added a comment - 19/May/06 05:14 AM
Sorry for the short comment.

Actually the meta tags functionality is already available in the 0.8 version along with a CrawlDatum object.

You can build the required functionality just by developing plugins for parsing indexing and querying....

HTH.


Gal Nitzan added a comment - 19/May/06 09:35 PM

Hi Stefan,

Indeed 0.8 is not release 1.0 yet but it is stable and we are using it in production.

As a whole Nutch is greate and does the job right. there is a lot of tweakiing to it but once you get the whole thing configured to your liking there is not much to change after.

In terms of plugin development, I do not think Java is that far from PHP so I do not think you would have hard time there. the plugins are usually pretty small code. Since most job is already done by Nutch.

for example you want to check certain rule and based on this rule to add some information into the index so you can later search your index based on that tag.

The way to go about it would be to develop a parse filter plugin. This plugin is called during the parse phase usualy it happens right after fetching unless disabled in conf.
The plugin has one interface: filter which gets the URL, content and a parse object which contains a meta data object, for every page fetched. There you can put an implementation that when the URL of the fetched page matched some criteria you would add a metat data tag.

Than you would add an index plugin that will take that meta data and store it in your index as a new field.

The last thing to do is write a query plugin that will enable you to search the index based on the field you added in your indexing phase.

HTH.

Gal.

These kind of questions should be sent through the user list and not Jira.


Andrzej Bialecki added a comment - 19/Jul/06 06:21 PM
I'm closing this issue, because this functionality can be achieved by using a combination of CrawlDatum.metaData and url/scoring filters.

Stefan Neufeind added a comment - 19/Jul/06 06:53 PM
Does somebody have an existing demo-plugin for that, that would catch URL-prefixes from a file and in case matches are found certain tags are then added? I don't yet fully get it how to do it "the elegant way"