Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4, nutchgora
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Parse-metatags plugin

      The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.

      In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml

      <property>
        <name>metatags.names</name>
        <value>description;keywords</value>
      </property>
      

      The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.

      The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml

      <property>
        <name>query.basic.description.boost</name>
        <value>2.0</value>
      </property>
      
      <property>
        <name>query.basic.keywords.boost</name>
        <value>2.0</value>
      </property>
      

      This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com

      1. NUTCH-809-trunk.patch
        15 kB
        Julien Nioche
      2. metatags-plugin+tutorial.zip
        29 kB
        Elisabeth Adler
      3. NUTCH-809_metatags_1.3.patch
        14 kB
        Elisabeth Adler
      4. NUTCH-809.patch
        20 kB
        Julien Nioche

        Issue Links

          Activity

          Hide
          Julien Nioche added a comment -

          Modified version of the plugin which is compatible with parse-tika

          Show
          Julien Nioche added a comment - Modified version of the plugin which is compatible with parse-tika
          Hide
          Markus Jelsma added a comment -

          Why don't we include this plugin?

          Show
          Markus Jelsma added a comment - Why don't we include this plugin?
          Hide
          Julien Nioche added a comment -

          It's been a long time and I'd forgotten about this one

          Obviously we don't need the QueryFilter anymore. Am not entirely happy with the indexing part of it though as we handle only 2 values (description and keywords) whereas the parsing step is open to any values specified by the users.

          We also have the urlmeta plugin which allows to track md specified in the seed lists and index them. The name of this plugin should be improved BTW

          (thinking aloud) why don't we have a generic indexing implementation which could index any metadata specified by the user be it from the crawldb or the parse metadata? The parse-metatags plugin would then only deal with the parsing step and leave the indexing to this indexer, which could also be used by the existing urlmeta (which would then only help with the transfer of the MD from a root page to its outlinks).

          We can also leave things as they are and just rename urlmeta into something like seed-metadata-propagation (or anything better) and keep the possibility to do specific things in the indexing part of the metadata like for instance splitting the keywords into multiple fields.

          Show
          Julien Nioche added a comment - It's been a long time and I'd forgotten about this one Obviously we don't need the QueryFilter anymore. Am not entirely happy with the indexing part of it though as we handle only 2 values (description and keywords) whereas the parsing step is open to any values specified by the users. We also have the urlmeta plugin which allows to track md specified in the seed lists and index them. The name of this plugin should be improved BTW (thinking aloud) why don't we have a generic indexing implementation which could index any metadata specified by the user be it from the crawldb or the parse metadata? The parse-metatags plugin would then only deal with the parsing step and leave the indexing to this indexer, which could also be used by the existing urlmeta (which would then only help with the transfer of the MD from a root page to its outlinks). We can also leave things as they are and just rename urlmeta into something like seed-metadata-propagation (or anything better) and keep the possibility to do specific things in the indexing part of the metadata like for instance splitting the keywords into multiple fields.
          Hide
          Markus Jelsma added a comment -

          Hm, sounds a bit like the index-extra plugin. I haven't used neither (nor this one) so i can't really tell which we'd dump and which we'd include.

          Show
          Markus Jelsma added a comment - Hm, sounds a bit like the index-extra plugin. I haven't used neither (nor this one) so i can't really tell which we'd dump and which we'd include.
          Hide
          Julien Nioche added a comment -

          I did not know about index-extra at all, have linked it to this issue.

          Show
          Julien Nioche added a comment - I did not know about index-extra at all, have linked it to this issue.
          Hide
          Elisabeth Adler added a comment -

          I updated the plugin to work under Nutch 1.3 (see attached patch NUTCH-809_metatags_1.3.patch). Documentation of usage is in the readme.txt of the plugin (it's called index-metatags).

          Show
          Elisabeth Adler added a comment - I updated the plugin to work under Nutch 1.3 (see attached patch NUTCH-809 _metatags_1.3.patch). Documentation of usage is in the readme.txt of the plugin (it's called index-metatags).
          Hide
          Lewis John McGibbney added a comment -

          This is great Elisabeth, thank you. Marked for possible inclusion in 1.4 (and of course nutchgora :0))

          Show
          Lewis John McGibbney added a comment - This is great Elisabeth, thank you. Marked for possible inclusion in 1.4 (and of course nutchgora :0))
          Hide
          Julien Nioche added a comment -

          -1 for committing patch 809_metatags as-is. The package names should be org.apache + the patch should modify schema.xml

          Ideally we'd merge it with NUTCH-422 + NUTCH-1005 but also url-meta instead of having various plugins doing very similar things. However this could be done in 1.5 and we could try and include this in 1.4

          Thanks

          Show
          Julien Nioche added a comment - -1 for committing patch 809_metatags as-is. The package names should be org.apache + the patch should modify schema.xml Ideally we'd merge it with NUTCH-422 + NUTCH-1005 but also url-meta instead of having various plugins doing very similar things. However this could be done in 1.5 and we could try and include this in 1.4 Thanks
          Hide
          Chris A. Mattmann added a comment -
          • push
          Show
          Chris A. Mattmann added a comment - push
          Hide
          Dean Del Ponte added a comment -

          Is this available as a packaged plugin? If so, how does one go about installing it? Thanks!

          Show
          Dean Del Ponte added a comment - Is this available as a packaged plugin? If so, how does one go about installing it? Thanks!
          Hide
          Elisabeth Adler added a comment -

          I attached 'metatags-plugin+tutorial.zip' which contains the bundled plugin and some documentation on how to install and use the metatags plugin.

          Guys, I wasn't sure if the documenation of the plugin should be included in the wiki since it's not actually part of Nutch yet, so please let me know if I should put it up on the wiki (and where.

          Show
          Elisabeth Adler added a comment - I attached 'metatags-plugin+tutorial.zip' which contains the bundled plugin and some documentation on how to install and use the metatags plugin. Guys, I wasn't sure if the documenation of the plugin should be included in the wiki since it's not actually part of Nutch yet, so please let me know if I should put it up on the wiki (and where .
          Hide
          Lewis John McGibbney added a comment -

          Hi Elisabeth although I haven't had time to look through your zip yet a big thank you must be aimed your way. If you have time and are willing please create a new page on the Nutch wiki under plugin central. As you can see this issue is closely linked to some others of similar nature so it may/may not change in the future, however community driven documentation is exactly what we are after and it is greatly welcomed.

          Please contact me off list or @ dev@ with your wiki username and I will add you to a the wiki contributers page.

          Thank you

          [1] http://wiki.apache.org/nutch/PluginCentral

          Show
          Lewis John McGibbney added a comment - Hi Elisabeth although I haven't had time to look through your zip yet a big thank you must be aimed your way. If you have time and are willing please create a new page on the Nutch wiki under plugin central. As you can see this issue is closely linked to some others of similar nature so it may/may not change in the future, however community driven documentation is exactly what we are after and it is greatly welcomed. Please contact me off list or @ dev@ with your wiki username and I will add you to a the wiki contributers page. Thank you [1] http://wiki.apache.org/nutch/PluginCentral
          Hide
          Elisabeth Adler added a comment -

          Documentation available on: http://wiki.apache.org/nutch/IndexMetatags

          Show
          Elisabeth Adler added a comment - Documentation available on: http://wiki.apache.org/nutch/IndexMetatags
          Hide
          Rajasekar Karthik added a comment -

          Hi Elisabeth - Metatags plugin is great.
          But, it does not work with Nutch 1.4 (followed the documentation)
          Does it only work with Nutch 1.3?
          Thanks!

          Show
          Rajasekar Karthik added a comment - Hi Elisabeth - Metatags plugin is great. But, it does not work with Nutch 1.4 (followed the documentation) Does it only work with Nutch 1.3? Thanks!
          Hide
          Elisabeth Adler added a comment -

          I haven't tested the plugin in 1.4 myself, but I think a few guys on the mailing list already used it with 1.4.

          Show
          Elisabeth Adler added a comment - I haven't tested the plugin in 1.4 myself, but I think a few guys on the mailing list already used it with 1.4.
          Hide
          Julien Nioche added a comment -

          Patch for Nutch-809 against trunk. Delegates the indexing to index-metatags

          Show
          Julien Nioche added a comment - Patch for Nutch-809 against trunk. Delegates the indexing to index-metatags
          Hide
          Julien Nioche added a comment -

          Trunk : Committed revision 1303371.

          Not activated by default. See nutch-default.xml for details.

          TODO update the WIKI, port to the gora branch add fields to SOLR and activate it by default (any volunteers?)

          Show
          Julien Nioche added a comment - Trunk : Committed revision 1303371. Not activated by default. See nutch-default.xml for details. TODO update the WIKI, port to the gora branch add fields to SOLR and activate it by default (any volunteers?)
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #206 (See https://builds.apache.org/job/nutch-trunk-maven/206/)
          NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371)

          Result = SUCCESS
          jnioche :
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/src/plugin/build.xml
          • /nutch/trunk/src/plugin/parse-metatags
          • /nutch/trunk/src/plugin/parse-metatags/README.txt
          • /nutch/trunk/src/plugin/parse-metatags/build.xml
          • /nutch/trunk/src/plugin/parse-metatags/ivy.xml
          • /nutch/trunk/src/plugin/parse-metatags/plugin.xml
          • /nutch/trunk/src/plugin/parse-metatags/sample
          • /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html
          • /nutch/trunk/src/plugin/parse-metatags/src
          • /nutch/trunk/src/plugin/parse-metatags/src/java
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java
          • /nutch/trunk/src/plugin/parse-metatags/src/test
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #206 (See https://builds.apache.org/job/nutch-trunk-maven/206/ ) NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371) Result = SUCCESS jnioche : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/nutch-default.xml /nutch/trunk/src/plugin/build.xml /nutch/trunk/src/plugin/parse-metatags /nutch/trunk/src/plugin/parse-metatags/README.txt /nutch/trunk/src/plugin/parse-metatags/build.xml /nutch/trunk/src/plugin/parse-metatags/ivy.xml /nutch/trunk/src/plugin/parse-metatags/plugin.xml /nutch/trunk/src/plugin/parse-metatags/sample /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html /nutch/trunk/src/plugin/parse-metatags/src /nutch/trunk/src/plugin/parse-metatags/src/java /nutch/trunk/src/plugin/parse-metatags/src/java/org /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java /nutch/trunk/src/plugin/parse-metatags/src/test /nutch/trunk/src/plugin/parse-metatags/src/test/org /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java
          Hide
          Lewis John McGibbney added a comment -

          Hi Julien,

          Can you confirm what you would like to see added to the wiki?, I will try my best to get this added, are you referring to the [0]? Also I thought the best thing to do regarding porting to Nutchgora is just to add it to the ever growing NUTCH-1104 list, so I have done so. If and when this is required over there someone can duly oblige
          Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml?
          Finally can you expand on 'activate by default', what exactly is it that not activated by default? I read your README.txt but I can see any mention of it in there.
          Thanks

          Oh and great patch, this is one which as we know is very much appreciated by everyone.
          [0] http://wiki.apache.org/nutch/IndexStructure

          Show
          Lewis John McGibbney added a comment - Hi Julien, Can you confirm what you would like to see added to the wiki?, I will try my best to get this added, are you referring to the [0] ? Also I thought the best thing to do regarding porting to Nutchgora is just to add it to the ever growing NUTCH-1104 list, so I have done so. If and when this is required over there someone can duly oblige Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml? Finally can you expand on 'activate by default', what exactly is it that not activated by default? I read your README.txt but I can see any mention of it in there. Thanks Oh and great patch, this is one which as we know is very much appreciated by everyone. [0] http://wiki.apache.org/nutch/IndexStructure
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1794 (See https://builds.apache.org/job/Nutch-trunk/1794/)
          NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371)

          Result = SUCCESS
          jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1303371
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/src/plugin/build.xml
          • /nutch/trunk/src/plugin/parse-metatags
          • /nutch/trunk/src/plugin/parse-metatags/README.txt
          • /nutch/trunk/src/plugin/parse-metatags/build.xml
          • /nutch/trunk/src/plugin/parse-metatags/ivy.xml
          • /nutch/trunk/src/plugin/parse-metatags/plugin.xml
          • /nutch/trunk/src/plugin/parse-metatags/sample
          • /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html
          • /nutch/trunk/src/plugin/parse-metatags/src
          • /nutch/trunk/src/plugin/parse-metatags/src/java
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse
          • /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java
          • /nutch/trunk/src/plugin/parse-metatags/src/test
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html
          • /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1794 (See https://builds.apache.org/job/Nutch-trunk/1794/ ) NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371) Result = SUCCESS jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1303371 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/nutch-default.xml /nutch/trunk/src/plugin/build.xml /nutch/trunk/src/plugin/parse-metatags /nutch/trunk/src/plugin/parse-metatags/README.txt /nutch/trunk/src/plugin/parse-metatags/build.xml /nutch/trunk/src/plugin/parse-metatags/ivy.xml /nutch/trunk/src/plugin/parse-metatags/plugin.xml /nutch/trunk/src/plugin/parse-metatags/sample /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html /nutch/trunk/src/plugin/parse-metatags/src /nutch/trunk/src/plugin/parse-metatags/src/java /nutch/trunk/src/plugin/parse-metatags/src/java/org /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java /nutch/trunk/src/plugin/parse-metatags/src/test /nutch/trunk/src/plugin/parse-metatags/src/test/org /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java
          Hide
          Julien Nioche added a comment -

          Hi Lewis

          Can you confirm what you would like to see added to the wiki?, I will try my best to get this added, are you referring to the [0]?

          Nope. I meant replacing the wiki page written by Elizabeth with instructions on what to do to get the metatags parsed and indexed. What I committed relies on another plugin for indexing metadata whereas the old one had its own indexer etc...

          Also I thought the best thing to do regarding porting to Nutchgora is just to add it to the ever growing NUTCH-1104 list, so I have done so. If and when this is required over there someone can duly oblige

          good thinking

          Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml?

          yes, this will be needed if we want this to be on by default which I think is a good idea

          Finally can you expand on 'activate by default', what exactly is it that not activated by default? I read your README.txt but I can see any mention of it in there.

          Plugins have to be listed in plugin.includes in order to be used. Thinking about it it would be good to declare a dependency to index-metatags so that the later is activated automatically (assuming plugin.auto-activation = true)

          Thanks

          Julien

          Show
          Julien Nioche added a comment - Hi Lewis Can you confirm what you would like to see added to the wiki?, I will try my best to get this added, are you referring to the [0] ? Nope. I meant replacing the wiki page written by Elizabeth with instructions on what to do to get the metatags parsed and indexed. What I committed relies on another plugin for indexing metadata whereas the old one had its own indexer etc... Also I thought the best thing to do regarding porting to Nutchgora is just to add it to the ever growing NUTCH-1104 list, so I have done so. If and when this is required over there someone can duly oblige good thinking Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml? yes, this will be needed if we want this to be on by default which I think is a good idea Finally can you expand on 'activate by default', what exactly is it that not activated by default? I read your README.txt but I can see any mention of it in there. Plugins have to be listed in plugin.includes in order to be used. Thinking about it it would be good to declare a dependency to index-metatags so that the later is activated automatically (assuming plugin.auto-activation = true) Thanks Julien
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Julien Nioche added a comment -

          Updated instructions on WIKI http://wiki.apache.org/nutch/IndexMetatags

          Show
          Julien Nioche added a comment - Updated instructions on WIKI http://wiki.apache.org/nutch/IndexMetatags
          Hide
          Kristof added a comment -

          Hello,

          I have been working with the plugin for some time and really works for everything (approx. 100 metadata fields) I need to extract from a set of webpages. I am mapping these fields to solr and only have a problem when it comes to fields which I want to convert to a format other than string. I have several date fields which are formatted as yyyy-mm-dd and no matter which way I try I do not get it to end up as solr date field as this requires the data in the format yyyy-mm-ddThh:mm:ssZ. Simply declaring the field as date in the schema results in an error. I have no control over the format in which the dates are stored in the webpages and nothing I tried in solr works, so my only remaining guess is that I need to look into changing the format within nutch. Any hint how to do that?

          Thanks!

          Kristof

          Show
          Kristof added a comment - Hello, I have been working with the plugin for some time and really works for everything (approx. 100 metadata fields) I need to extract from a set of webpages. I am mapping these fields to solr and only have a problem when it comes to fields which I want to convert to a format other than string. I have several date fields which are formatted as yyyy-mm-dd and no matter which way I try I do not get it to end up as solr date field as this requires the data in the format yyyy-mm-ddThh:mm:ssZ. Simply declaring the field as date in the schema results in an error. I have no control over the format in which the dates are stored in the webpages and nothing I tried in solr works, so my only remaining guess is that I need to look into changing the format within nutch. Any hint how to do that? Thanks! Kristof
          Hide
          Julien Nioche added a comment -

          Kristof, please use the mailing list instead. Thanks

          Show
          Julien Nioche added a comment - Kristof, please use the mailing list instead. Thanks
          Hide
          Kristof added a comment -

          Julien, maybe another way round will work. I have been trying to find the source file for MetaTagsIndexer.class to adjust it, but I can only seem to find MetaTagsParser.java. Would be great if anyone could send me the MetaTagsIndexer,java file. Thanks! If that does not work, I will try the mailing list.

          Show
          Kristof added a comment - Julien, maybe another way round will work. I have been trying to find the source file for MetaTagsIndexer.class to adjust it, but I can only seem to find MetaTagsParser.java. Would be great if anyone could send me the MetaTagsIndexer,java file. Thanks! If that does not work, I will try the mailing list.
          Hide
          Kristof added a comment -

          I finally decompiled the binary of MetaTagsIndexer.class and extended it so that I can now configure in nutch-site.xml which metatag I want to convert to date format. Works fine with mapping it to Solr date fields. As I am new to JIRA I do know how to share this. Let me know if would be helpful and if so how I can share it here.

          Show
          Kristof added a comment - I finally decompiled the binary of MetaTagsIndexer.class and extended it so that I can now configure in nutch-site.xml which metatag I want to convert to date format. Works fine with mapping it to Solr date fields. As I am new to JIRA I do know how to share this. Let me know if would be helpful and if so how I can share it here.
          Hide
          Markus Jelsma added a comment -

          Please open a new issue and describe your improvement. You can then attach your patch to the newly created ticket.

          thanks

          Show
          Markus Jelsma added a comment - Please open a new issue and describe your improvement. You can then attach your patch to the newly created ticket. thanks

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Julien Nioche
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development