Details
Description
A html parse filter that will filter out the outlinks in two stages.
Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass.
Attachments
Issue Links
- is related to
-
NUTCH-2056 Move the Mahout and Lucene dependencies to the plugin from the main ivy.xml for the Naive Bayes Parse Filter (NUTCH-2038)
- Open
-
NUTCH-2057 Put all the files produced during training of the model for Naive Bayes classifier, in the Naive Bayes Parse Filter (NUTCH-2038), in a single folder
- Open