Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-585

[PARSE-HTML plugin] Block certain parts of HTML code from being indexed

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      All operating systems

    • Patch Info:
      Patch Available

      Description

      We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches.

      We have modified the plugin so that it ignores HTML code between certain HTML comments, like
      <!-- START-IGNORE -->
      ... ignored part ...
      <!-- STOP-IGNORE -->

      We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).

      We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong!

        Attachments

        1. blacklist_whitelist_plugin.patch
          21 kB
          Elisabeth Adler
        2. nutch-585-excludeNodes.patch
          6 kB
          Rui Araújo
        3. nutch-585-jostens-excludeDIVs.patch
          4 kB
          N. Hira

          Activity

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              spino.spinelli Andrea Spinelli
            • Votes:
              9 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated: