Nutch
  1. Nutch
  2. NUTCH-585

[PARSE-HTML plugin] Block certain parts of HTML code from being indexed

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.10
    • Component/s: None
    • Labels:
      None
    • Environment:

      All operating systems

    • Patch Info:
      Patch Available

      Description

      We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches.

      We have modified the plugin so that it ignores HTML code between certain HTML comments, like
      <!-- START-IGNORE -->
      ... ignored part ...
      <!-- STOP-IGNORE -->

      We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).

      We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong!

      1. blacklist_whitelist_plugin.patch
        21 kB
        Elisabeth Adler
      2. nutch-585-excludeNodes.patch
        6 kB
        Rui Araújo
      3. nutch-585-jostens-excludeDIVs.patch
        4 kB
        N. Hira

        Activity

        Hide
        Otis Gospodnetic added a comment -

        A more general solution is needed. This solution should not rely on apriori marked-up content as in your example, but should automatically recognize things like footers, sidebars, repeating navigation and other elements, etc.

        I am sure there are PhD thesis out there on this topic...

        Show
        Otis Gospodnetic added a comment - A more general solution is needed. This solution should not rely on apriori marked-up content as in your example, but should automatically recognize things like footers, sidebars, repeating navigation and other elements, etc. I am sure there are PhD thesis out there on this topic...
        Hide
        Andrea Spinelli added a comment -

        I absolutely agree that a more general solution is needed; however, I think that some of the Nutch current users might benefit from a quick fix.

        If there is no opposition, I could submit a patch (less than 20 lines)

        On the other hand,anybody thinks that blocking selected portions of text could pose serious architectural or stability risks?

        About the more general solution, do you think there is a viable path from here to there?

        – andrea

        Show
        Andrea Spinelli added a comment - I absolutely agree that a more general solution is needed; however, I think that some of the Nutch current users might benefit from a quick fix. If there is no opposition, I could submit a patch (less than 20 lines) On the other hand,anybody thinks that blocking selected portions of text could pose serious architectural or stability risks? About the more general solution, do you think there is a viable path from here to there? – andrea
        Hide
        Matt Kangas added a comment -

        Simplest path forward... that I can think of:

        1) Add a new indexing plugin extension-point for filtering page content.
        2) Put your "apriori marked-up content" exclusion logic into a plugin.
        3) Someone else figures out a more general-purpose solution later, and swaps out your plugin at that time.

        Ergo, you generalize the interface, and lazy-load the more general implementation.

        Show
        Matt Kangas added a comment - Simplest path forward... that I can think of: 1) Add a new indexing plugin extension-point for filtering page content. 2) Put your "apriori marked-up content" exclusion logic into a plugin. 3) Someone else figures out a more general-purpose solution later, and swaps out your plugin at that time. Ergo, you generalize the interface, and lazy-load the more general implementation.
        Hide
        Siddharth Jha added a comment -

        Hello All

        We are implementing a search engine based on Lucene/Nutch and we are also facing problems with caching. Would it be possible for you to help me out on this issue and provide me a code snippet?

        Show
        Siddharth Jha added a comment - Hello All We are implementing a search engine based on Lucene/Nutch and we are also facing problems with caching. Would it be possible for you to help me out on this issue and provide me a code snippet?
        Hide
        cwinay@yahoo.com added a comment -

        Hi,
        Is it possible for you to share the code with me??
        I seem to have found a use of the facility you wish to add to Nutch.
        I'm using a content management system called Infoglue to create my website.
        The pages I create for my site have a fixed template containing header, footer and a menu system.
        I wish that Nutch should index the template content only for the home page and I want it to index just the relevant (non-template) content on the inner pages.

        So please share your idea and/or code.
        Details of the implementation are appreciated.
        So far I have just been a naive Nutch user.

        Thanks a lot.
        Winz

        Quoted from:
        http://www.nabble.com/-jira--Created%3A-%28NUTCH-585%29--PARSE-HTML-plugin--Block-certain-parts-of-HTML-code-from-being-indexed-tp14023775p14023775.html

        Show
        cwinay@yahoo.com added a comment - Hi, Is it possible for you to share the code with me?? I seem to have found a use of the facility you wish to add to Nutch. I'm using a content management system called Infoglue to create my website. The pages I create for my site have a fixed template containing header, footer and a menu system. I wish that Nutch should index the template content only for the home page and I want it to index just the relevant (non-template) content on the inner pages. So please share your idea and/or code. Details of the implementation are appreciated. So far I have just been a naive Nutch user. Thanks a lot. Winz Quoted from: http://www.nabble.com/-jira--Created%3A-%28NUTCH-585%29--PARSE-HTML-plugin--Block-certain-parts-of-HTML-code-from-being-indexed-tp14023775p14023775.html
        Hide
        Andrea Spinelli added a comment -

        Yes, I'd be glad about that.

        There are some caveats, though:

        1. I worked on a very old version of nutch (0.7.2)
        2. I have to dig in my sources to find our patch, because it happened a
        lot of time ago

        We have a week long of demos, I will write back at the end of the
        working week

        Hi
        Andrea


        Andrea Spinelli - team QUALITY
        email: andrea.spinelli@imteam.it
        phone: +39-035-636029
        fax: +39-035-638129
        surface-mail: Via Sigismondi 40, 24018 Villa d'Alme', BG

        Questo messaggio è confidenziale; ai sensi del D.P.R. 44/314159/2718
        01/04/2009 non puoi pubblicarlo o inoltrarlo e non potrai mai
        più utilizzare nessuna delle parole italiane in esso presenti.
        Se lo ricevi per errore, spruzzalo con spray al peperoncino,
        cancellalo, formatta il tuo hard disk e poi scrivi una cartolina
        all'indirizzo sopra indicato avvisandoci.

        Show
        Andrea Spinelli added a comment - Yes, I'd be glad about that. There are some caveats, though: 1. I worked on a very old version of nutch (0.7.2) 2. I have to dig in my sources to find our patch, because it happened a lot of time ago We have a week long of demos, I will write back at the end of the working week Hi Andrea – Andrea Spinelli - team QUALITY email: andrea.spinelli@imteam.it phone: +39-035-636029 fax: +39-035-638129 surface-mail: Via Sigismondi 40, 24018 Villa d'Alme', BG – Questo messaggio è confidenziale; ai sensi del D.P.R. 44/314159/2718 01/04/2009 non puoi pubblicarlo o inoltrarlo e non potrai mai più utilizzare nessuna delle parole italiane in esso presenti. Se lo ricevi per errore, spruzzalo con spray al peperoncino, cancellalo, formatta il tuo hard disk e poi scrivi una cartolina all'indirizzo sopra indicato avvisandoci.
        Hide
        David Stuart added a comment -

        Hi Andrea,

        I hope your week of demo's went well. I to would be interested in this code as I would like to look at extending to it be slightly more generic allowing for regular expression matches or an xpath like model (the plan is still formulating). From the web crawler view it would be a hard one to get right but we have about 26 sites that are will know to us that we wish to crawl and have common blocks that we wish to remove which a configurable version of your code may achieve.

        Look forward to see your patch

        Regards,

        David Stuart

        Show
        David Stuart added a comment - Hi Andrea, I hope your week of demo's went well. I to would be interested in this code as I would like to look at extending to it be slightly more generic allowing for regular expression matches or an xpath like model (the plan is still formulating). From the web crawler view it would be a hard one to get right but we have about 26 sites that are will know to us that we wish to crawl and have common blocks that we wish to remove which a configurable version of your code may achieve. Look forward to see your patch Regards, David Stuart
        Hide
        Rich Goguen added a comment -

        Hi Andrea,

        I would also be interested in the code.

        Thank you.

        Rich

        Show
        Rich Goguen added a comment - Hi Andrea, I would also be interested in the code. Thank you. Rich
        Hide
        N. Hira added a comment -

        We use Solr/Nutch on our corporate web site and are very happy with the results. Thank you. We have struggled with something similar to NUTCH-585 for a few months now.

        Although it is different from the original intent, here's a quick/short patch that might help get this feature going again.

        Intended use:

        • Let's assume you're crawling a set of internal web sites and would like to exclude certain HTML fragments (from indexing) like the navigation and other common content.
        • If these fragments are contained in DIVs with IDs like "menuNav", "footerNav", etc., then you can now add a new property to nutch-site.xml to exclude these DIVs.
        • If you don't set this property, the normal behavior remains (backward compatible)
          <property>
            <name>parser.html.divIDsToExclude</name
            <value>account_menu_container,footer_menu_container,legal,main_menu_container</value>
            <description>
            A comma-delimited list of DIV IDs whose content will not be indexed.  Use this to tell
            the HTML parser to ignore, for example, site navigation text.
            Note that DIVs with these IDs, and their children, will be silently ignored by the parser
            so verify the indexed content with Luke to confirm results.
            </description>
          </property>
          

        Inclusion/growth:

        • This code was written against nutch 1.2 and is backward compatible in that the new behavior is only present if configured.
        • In future, it might be good to have different "strategy patterns" for how exclusions are determined; some might need algorithmic detection (whole web crawls), others might prefer jquery-selectors for HTML fragments, etc.

        Best regards,

        -h

        Hira, N.R. (Jostens, Inc.)

        Show
        N. Hira added a comment - We use Solr/Nutch on our corporate web site and are very happy with the results. Thank you. We have struggled with something similar to NUTCH-585 for a few months now. Although it is different from the original intent, here's a quick/short patch that might help get this feature going again. Intended use: Let's assume you're crawling a set of internal web sites and would like to exclude certain HTML fragments (from indexing) like the navigation and other common content. If these fragments are contained in DIVs with IDs like "menuNav", "footerNav", etc., then you can now add a new property to nutch-site.xml to exclude these DIVs. If you don't set this property, the normal behavior remains (backward compatible) <property> <name> parser.html.divIDsToExclude</name <value> account_menu_container,footer_menu_container,legal,main_menu_container </value> <description> A comma-delimited list of DIV IDs whose content will not be indexed. Use this to tell the HTML parser to ignore, for example, site navigation text. Note that DIVs with these IDs, and their children, will be silently ignored by the parser so verify the indexed content with Luke to confirm results. </description> </property> Inclusion/growth: This code was written against nutch 1.2 and is backward compatible in that the new behavior is only present if configured. In future, it might be good to have different "strategy patterns" for how exclusions are determined; some might need algorithmic detection (whole web crawls), others might prefer jquery-selectors for HTML fragments, etc. Best regards, -h Hira, N.R. (Jostens, Inc.)
        Hide
        Wim Mostrey added a comment -

        The patch provided by N. Hira works as advertised on Nutch 1.2.

        Show
        Wim Mostrey added a comment - The patch provided by N. Hira works as advertised on Nutch 1.2.
        Hide
        Markus Jelsma added a comment -

        Thanks for mentioning Wim. This patch can be useful for a quick solution. Perhaps it can be incorporated in a Nutch release.

        Show
        Markus Jelsma added a comment - Thanks for mentioning Wim. This patch can be useful for a quick solution. Perhaps it can be incorporated in a Nutch release.
        Hide
        Rui Araújo added a comment -

        I can also confirm that the patch works on Nutch 1.3.

        However, it didn't work for my use-case as I need to filter a diverse set of tag
        based on different attributes. Besides I needed the links from the filtered area
        which did not happen.

        So I altered Hira's patch and I am publishing my work here.

        This is the new changed property.

         
        <property>
          <name>parser.html.NodesToExclude</name>
          <value>table;summary;header|div;id;navigation</value>
          <description>
          A list of nodes whose content will not be indexed separated by "|".  Use this to tell
          the HTML parser to ignore, for example, site navigation text.
          Each node has three elements: the first one is the tag name, the second one the
          attribute name, the third one the value of the attribute.
          Note that nodes with these attributes, and their children, will be silently ignored by the parser
          so verify the indexed content with Luke to confirm results.
          </description>
        </property>
        

        I really think this should be present in Nutch. I am available to improve the patch until it is ready for inclusion. Also I am looking for comments on how I implemented my improvements.

        Thanks,
        Rui

        Show
        Rui Araújo added a comment - I can also confirm that the patch works on Nutch 1.3. However, it didn't work for my use-case as I need to filter a diverse set of tag based on different attributes. Besides I needed the links from the filtered area which did not happen. So I altered Hira's patch and I am publishing my work here. This is the new changed property. <property> <name> parser.html.NodesToExclude </name> <value> table;summary;header|div;id;navigation </value> <description> A list of nodes whose content will not be indexed separated by "|" . Use this to tell the HTML parser to ignore, for example, site navigation text. Each node has three elements: the first one is the tag name, the second one the attribute name, the third one the value of the attribute. Note that nodes with these attributes, and their children, will be silently ignored by the parser so verify the indexed content with Luke to confirm results. </description> </property> I really think this should be present in Nutch. I am available to improve the patch until it is ready for inclusion. Also I am looking for comments on how I implemented my improvements. Thanks, Rui
        Hide
        Rui Araújo added a comment -

        Exclude Nodes Patch.

        Show
        Rui Araújo added a comment - Exclude Nodes Patch.
        Hide
        Markus Jelsma added a comment -

        Marked for 1.4. Thanks!

        Show
        Markus Jelsma added a comment - Marked for 1.4. Thanks!
        Hide
        Rui Araújo added a comment -

        Cool!

        Anyway, as I said before my patch extracts the link from the filtered area while Hira's patch will filter before any extraction is done.

        Do you think that this behavior should be configurable?

        Show
        Rui Araújo added a comment - Cool! Anyway, as I said before my patch extracts the link from the filtered area while Hira's patch will filter before any extraction is done. Do you think that this behavior should be configurable?
        Hide
        Elisabeth Adler added a comment -

        Based on the suggestions/code above, I have created a plugin to blacklist or whitelist html elements (blacklist_whitelist_plugin.patch). This was based on the need for not indexing header/footer/navigation, so the user gets really only relevant results, e.g. even if the term shows up in the navigation.

        The elements to be parsed (or not) can be defined by using CSS-like selectors. A new field called "strippedContent" is available in the index which can be used for searching. Links are still crawled and parsed from the "content" field, allowing all pages to be parsed. The full documentation is in the README.txt in the patch.

        Show
        Elisabeth Adler added a comment - Based on the suggestions/code above, I have created a plugin to blacklist or whitelist html elements (blacklist_whitelist_plugin.patch). This was based on the need for not indexing header/footer/navigation, so the user gets really only relevant results, e.g. even if the term shows up in the navigation. The elements to be parsed (or not) can be defined by using CSS-like selectors. A new field called "strippedContent" is available in the index which can be used for searching. Links are still crawled and parsed from the "content" field, allowing all pages to be parsed. The full documentation is in the README.txt in the patch.
        Hide
        Julien Nioche added a comment -

        Marking for 1.5. Needs reviewing and won't make it into 1.4

        Show
        Julien Nioche added a comment - Marking for 1.5. Needs reviewing and won't make it into 1.4
        Hide
        Abhay Dabholkar added a comment -

        I been using this plugin for sometime and recently i wanted to use it to extract text based on url pattern
        Example:
        for http://x.y.z/?id=12 white-list will only look for div id=12
        for http://x.y.z/?id=13 white-list will only look for div id=13

        ref: http://lucene.472066.n3.nabble.com/index-blacklist-whitelist-pluign-for-multiple-set-of-urls-td3711697.html

        Show
        Abhay Dabholkar added a comment - I been using this plugin for sometime and recently i wanted to use it to extract text based on url pattern Example: for http://x.y.z/?id=12 white-list will only look for div id=12 for http://x.y.z/?id=13 white-list will only look for div id=13 ref: http://lucene.472066.n3.nabble.com/index-blacklist-whitelist-pluign-for-multiple-set-of-urls-td3711697.html
        Hide
        Lewis John McGibbney added a comment -

        I like this contribution Elisabeth. Is there any way it could be updated to trunk with the following suggestions
        1) Please rename the package names to org.apache.nutch.blah.blah
        2) In your ivy.xml please change the ivy-configuration.xml to

          <configurations>
              <include file="../../..//ivy/ivy-configurations.xml"/>
          </configurations>
        

        This is eclipse specific.
        3) Would it be possible to change the CHANGES.txt to package.html and store it in the lowest most folder within the java directory
        4) It would really put the cherry on top if we could get a test case scenario, this would be a big +1.
        5) I think the name is maybe a bit large... but I am fine keeping it if you think it is appropriate as it is your patch afterall.

        Thank you for the contribution.

        Show
        Lewis John McGibbney added a comment - I like this contribution Elisabeth. Is there any way it could be updated to trunk with the following suggestions 1) Please rename the package names to org.apache.nutch.blah.blah 2) In your ivy.xml please change the ivy-configuration.xml to <configurations> <include file= "../../.. //ivy/ivy-configurations.xml" /> </configurations> This is eclipse specific. 3) Would it be possible to change the CHANGES.txt to package.html and store it in the lowest most folder within the java directory 4) It would really put the cherry on top if we could get a test case scenario, this would be a big +1. 5) I think the name is maybe a bit large... but I am fine keeping it if you think it is appropriate as it is your patch afterall. Thank you for the contribution.
        Hide
        Markus Jelsma added a comment -

        20120304-push-1.6

        Show
        Markus Jelsma added a comment - 20120304-push-1.6
        Hide
        Roberto Gardenier added a comment -

        Hello all,

        I've stumbled upon this ticket in my research to achieve the stated situation: block certain html parts from being indexed.
        I understand that this plugin/patch is achieves the desired situation, only i cannot seem to understand the following:

        • Will this feature be implemented in nutch 1.5 (according to Julien Nioche - 28/Sep/11 11:24) or will this be implemented in 1.6 (if this is what Markus Jelsma means with his comment on 03/Apr/12 12:08)?
        • Reason that I want to know is because I want to use the giving plugin but I can also wait of the nutch 1.5 release date isnt that far away.

        It would be great if someone could advice me.
        Many thanks in advance.

        With kind regards,
        Roberto Gardenier

        Show
        Roberto Gardenier added a comment - Hello all, I've stumbled upon this ticket in my research to achieve the stated situation: block certain html parts from being indexed. I understand that this plugin/patch is achieves the desired situation, only i cannot seem to understand the following: Will this feature be implemented in nutch 1.5 (according to Julien Nioche - 28/Sep/11 11:24) or will this be implemented in 1.6 (if this is what Markus Jelsma means with his comment on 03/Apr/12 12:08)? I found out that yesterday there was a vote concerning nutch 1.5 rc1: http://lucene.472066.n3.nabble.com/VOTE-Apache-Nutch-1-5-release-rc-1-td3913604.html . Is this a reliable source ? If so, what are the prospects upon releasing this version? Reason that I want to know is because I want to use the giving plugin but I can also wait of the nutch 1.5 release date isnt that far away. It would be great if someone could advice me. Many thanks in advance. With kind regards, Roberto Gardenier
        Hide
        Markus Jelsma added a comment -

        This issue is not going to be part of Nutch 1.5 which is likely to be released very soon. However, you can download the patch and see if it works for you, the plugin builds fine for 1.4, 1.5 and the to-be 1.6-SNAPSHOT.

        Show
        Markus Jelsma added a comment - This issue is not going to be part of Nutch 1.5 which is likely to be released very soon. However, you can download the patch and see if it works for you, the plugin builds fine for 1.4, 1.5 and the to-be 1.6-SNAPSHOT.
        Hide
        Roberto Gardenier added a comment -

        Thank you for your quick reply! Much appreciated!

        We are using nutch 1.4 so I will use the nutch-585-excludeNodes.patch for blocking certain html blocks. I assume that using the start en stop tags provided in the description is all we need to get things working? So we dont have to edit any config files ?

        Kind regards,
        Roberto Gardenier

        Show
        Roberto Gardenier added a comment - Thank you for your quick reply! Much appreciated! We are using nutch 1.4 so I will use the nutch-585-excludeNodes.patch for blocking certain html blocks. I assume that using the start en stop tags provided in the description is all we need to get things working? So we dont have to edit any config files ? Kind regards, Roberto Gardenier
        Hide
        Markus Jelsma added a comment - - edited

        You should take the latest patch: blacklist_whitelist_plugin.patch. It contains example config etc. Please let us know if you get it to work. Also check Rui's comment. It does not work with start/stop-tags anymore.

        https://issues.apache.org/jira/browse/NUTCH-585?focusedCommentId=13107294&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13107294

        Show
        Markus Jelsma added a comment - - edited You should take the latest patch: blacklist_whitelist_plugin.patch. It contains example config etc. Please let us know if you get it to work. Also check Rui's comment. It does not work with start/stop-tags anymore. https://issues.apache.org/jira/browse/NUTCH-585?focusedCommentId=13107294&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13107294
        Hide
        Iwan Luijks added a comment -

        I can confirm the plugin provided in blacklist_whitelist_plugin.patch also works for Nutch 1.5.1 without extra configuration.

        Show
        Iwan Luijks added a comment - I can confirm the plugin provided in blacklist_whitelist_plugin.patch also works for Nutch 1.5.1 without extra configuration.
        Hide
        Roberto Gardenier added a comment -

        Will this patch be implemented in Nutch at all? I've seen this patch / feature request being marked from 1.4 up till 1.7 now.
        Even though the patch works with Nutch 1.5 up till 1.5.1 I wonder if this will become part of Nutch at any time, Markus Jelsma?

        Show
        Roberto Gardenier added a comment - Will this patch be implemented in Nutch at all? I've seen this patch / feature request being marked from 1.4 up till 1.7 now. Even though the patch works with Nutch 1.5 up till 1.5.1 I wonder if this will become part of Nutch at any time, Markus Jelsma ?
        Hide
        Bojan Tomic added a comment - - edited

        I adapted Elisabeth Adler's plugin for use with Nutch 2.1 and added two small features:

        • the ability to protect certain URLs from filtering
        • the ability to configure the field where the filtered content is stored (overwriting the text field by default)

        I didn't immediately realize the common practice is creating a patch, so I put my stuff on GitHub: https://github.com/veggen/nutch-element-selector
        but if anyone cares about including this, I will gladly make a patch as well (and change package names, rename the plugin to it's original name etc).

        Show
        Bojan Tomic added a comment - - edited I adapted Elisabeth Adler's plugin for use with Nutch 2.1 and added two small features: the ability to protect certain URLs from filtering the ability to configure the field where the filtered content is stored (overwriting the text field by default) I didn't immediately realize the common practice is creating a patch, so I put my stuff on GitHub: https://github.com/veggen/nutch-element-selector but if anyone cares about including this, I will gladly make a patch as well (and change package names, rename the plugin to it's original name etc).
        Hide
        kiran added a comment -

        Hi Tomic,

        If you are using SVN, please see here for instructions (https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing).

        If not, this will be useful (http://docs.moodle.org/dev/How_to_create_a_patch) for general purposes.

        Thanks for your contribution.

        Show
        kiran added a comment - Hi Tomic, If you are using SVN, please see here for instructions ( https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing ). If not, this will be useful ( http://docs.moodle.org/dev/How_to_create_a_patch ) for general purposes. Thanks for your contribution.
        Hide
        Iwan Luijks added a comment -

        Hi Bojan Tomic,

        Did you succeed making a patch for fixing this issue for Nutch 2.*, it would be nice if this could be included as so in that version as well?

        Show
        Iwan Luijks added a comment - Hi Bojan Tomic , Did you succeed making a patch for fixing this issue for Nutch 2.*, it would be nice if this could be included as so in that version as well?

          People

          • Assignee:
            Markus Jelsma
            Reporter:
            Andrea Spinelli
          • Votes:
            7 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development