Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2038

Naive Bayes classifier based html Parse filter (for filtering outlinks)

    XMLWordPrintableJSON

    Details

      Description

      A html parse filter that will filter out the outlinks in two stages.
      Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                asitang Asitang Mishra
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: