Nutch
  1. Nutch
  2. NUTCH-585

[PARSE-HTML plugin] Block certain parts of HTML code from being indexed

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.11
    • Component/s: None
    • Labels:
      None
    • Environment:

      All operating systems

    • Patch Info:
      Patch Available

      Description

      We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches.

      We have modified the plugin so that it ignores HTML code between certain HTML comments, like
      <!-- START-IGNORE -->
      ... ignored part ...
      <!-- STOP-IGNORE -->

      We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).

      We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong!

      1. nutch-585-jostens-excludeDIVs.patch
        4 kB
        N. Hira
      2. nutch-585-excludeNodes.patch
        6 kB
        Rui Araújo
      3. blacklist_whitelist_plugin.patch
        21 kB
        Elisabeth Adler

        Activity

        Andrea Spinelli created issue -
        N. Hira made changes -
        Field Original Value New Value
        Attachment nutch-585-jostens-excludeDIVs.patch [ 12467198 ]
        Rui Araújo made changes -
        Attachment nutch-585-excludeNodes.patch [ 12494949 ]
        Markus Jelsma made changes -
        Assignee Markus Jelsma [ markus17 ]
        Fix Version/s 1.4 [ 12316519 ]
        Patch Info [Patch Available]
        Elisabeth Adler made changes -
        Attachment blacklist_whitelist_plugin.patch [ 12495393 ]
        Julien Nioche made changes -
        Fix Version/s 1.5 [ 12318246 ]
        Fix Version/s 1.4 [ 12316519 ]
        Markus Jelsma made changes -
        Fix Version/s 1.6 [ 12319941 ]
        Fix Version/s 1.5 [ 12318246 ]
        Roberto Gardenier made changes -
        Comment [ I have compiled nutch 1.5.1 with the provided plugin and used the configuration as described above. This all without success.
        Could anyone assist me on troubleshooting ?

        Nutch crawls and SOLR indexes with success but the content field still includes content of which are supposed to be blacklisted.

        Steps:
        1. Patched Nutch 1.5.1. with above blacklist_whitelist_plugin.patch
        2. Enabled the plugin in nutch-default.xml plugin.includes: index-blacklist-whitelist.
        3. Added the new field strippedContent to schema.xml (both nutch and solr) <!-- fields for the blacklist/whitelist plugin --> <field name="strippedContent" type="text" stored="true" indexed="true"/>.
        4. Configured parser.html.blacklist to blacklist "div.kruimelspoor" in nutch-default.xml.

        I pointed nutch at my site and fired it. I dont get warnings/errors or any kind of showstoppers, the crawling goes well and the index is filled. But still with everything inside div.kruimelspoor.
        ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.7 [ 12323281 ]
        Fix Version/s 1.6 [ 12319941 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.8 [ 12324326 ]
        Fix Version/s 1.7 [ 12323281 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.9 [ 12324611 ]
        Fix Version/s 1.8 [ 12324326 ]
        Julien Nioche made changes -
        Fix Version/s 1.10 [ 12327187 ]
        Fix Version/s 1.9 [ 12324611 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.11 [ 12329358 ]
        Fix Version/s 1.10 [ 12327187 ]

          People

          • Assignee:
            Markus Jelsma
            Reporter:
            Andrea Spinelli
          • Votes:
            7 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development