Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2038

Naive Bayes classifier based html Parse filter (for filtering outlinks)

    XMLWordPrintableJSON

Details

    Description

      A html parse filter that will filter out the outlinks in two stages.
      Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass.

      Attachments

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              asitang Asitang Mishra
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: