Nutch
  1. Nutch
  2. NUTCH-422

index-extra plugin creates additional fields in the index, based on configurable logic

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.8.1
    • Fix Version/s: 1.5
    • Component/s: indexer
    • Labels:
      None
    • Environment:

      All environments

      Description

      Extract from the Readme file:

      A. Introduction

      The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:

      • The parsed text
      • Meta data fields
      • Previously created document-to-be-indexed fields
      • Plain constant string
      • Java expression combining one or more of the above, and resolving to a string
        A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.

      B. Installation

      1) Binaries only: Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
      Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
      Enable the plugin by updating the nutch-site.xml file
      2) Source code: Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
      Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
      Update the build.xml in NUTCHDIR/src/plugin to include plugin
      Update the NUTCHDIR/default.properties file to include plugin
      run ant to build
      Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
      Enable the plugin by updating the nutch-site.xml file

      C. Known Issues

      1) For this plugin to work correctly on any document field, it is necessary to run the other index filters
      first, so that all basic document fields are generated first. To do this, configure the indexingfilter.order
      property. (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
      the plugin will still work, but will not be able to use document fields created by other index filter plugins.)

      2) At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
      document-level boost calculation. This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

      1. index-extra-v1.0-source.zip
        319 kB
        Alan Tanaman
      2. index-extra-v1.0-bin-java1.5.zip
        22 kB
        Alan Tanaman
      3. ExtraIndexingFilter.java
        11 kB
        garpinc

        Issue Links

          Activity

          Hide
          nutch.newbie added a comment -

          I have got it to work.. but took me a while to properly index fields.. Its a rather complex plugin and definitely requires more documentation and example from a newbie prospective. I can see my indexed field using Luke. However I don't have the necessary query-plugin to do a search - find 'xyz' in filed 'author' meta data etc.. Any plans for query-extra plugin? where you define query items via query-extra-conf.xml or something similler??

          Also the boost feature is important do you have any patch to solve known issue 2.

          Good work for getting a complex plugin to work not so complexly :-0)

          Show
          nutch.newbie added a comment - I have got it to work.. but took me a while to properly index fields.. Its a rather complex plugin and definitely requires more documentation and example from a newbie prospective. I can see my indexed field using Luke. However I don't have the necessary query-plugin to do a search - find 'xyz' in filed 'author' meta data etc.. Any plans for query-extra plugin? where you define query items via query-extra-conf.xml or something similler?? Also the boost feature is important do you have any patch to solve known issue 2. Good work for getting a complex plugin to work not so complexly :-0)
          Hide
          Alan Tanaman added a comment -

          Many thanks for your feedback.

          Do you have any specifics in mind regarding examples? I will try and include any additional ones that we implement. I know there are a lot of options, but it is a little hard to see what is unclear from my end – as I am so involved in the development, another point-of-view on this is welcome.

          Regarding query-extra, we are not currently using the Nutch bean, so the need has not arisen for us at this point in time, but I can see how that would be useful. I guess you could adapt one of the existing query-xxxx plugins fairly easily by having them read the xml configuration file to see what fields are potentially available in the index.

          As for the boost, I included that as it seems like a useful thing to be able to control the boost of a single field, although we don't need that at this very moment. The line of code in the org.apache.nutch.indexer.Indexer's
          reduce method could be overridden, but I'm not yet sure how that would affect the overall scoring (scoring is one of my really weak points).
          Perhaps one of the scoring experts could give some guidance on this?

          Show
          Alan Tanaman added a comment - Many thanks for your feedback. Do you have any specifics in mind regarding examples? I will try and include any additional ones that we implement. I know there are a lot of options, but it is a little hard to see what is unclear from my end – as I am so involved in the development, another point-of-view on this is welcome. Regarding query-extra, we are not currently using the Nutch bean, so the need has not arisen for us at this point in time, but I can see how that would be useful. I guess you could adapt one of the existing query-xxxx plugins fairly easily by having them read the xml configuration file to see what fields are potentially available in the index. As for the boost, I included that as it seems like a useful thing to be able to control the boost of a single field, although we don't need that at this very moment. The line of code in the org.apache.nutch.indexer.Indexer's reduce method could be overridden, but I'm not yet sure how that would affect the overall scoring (scoring is one of my really weak points). Perhaps one of the scoring experts could give some guidance on this?
          Hide
          Sami Siren added a comment -

          Is there a reason for the two takarta-regexp-jars (v 1.2 and 1.3) in source package?

          Show
          Sami Siren added a comment - Is there a reason for the two takarta-regexp-jars (v 1.2 and 1.3) in source package?
          Hide
          Sami Siren added a comment -

          couple of more points:
          -source files use tabs for indentation
          -headers of files are not consistent, should be updated
          -module contains jdom which is already part of nutch, should instead use existing one
          -no junit tests, not strictly a requirement but a big plus is to have some!

          Show
          Sami Siren added a comment - couple of more points: -source files use tabs for indentation -headers of files are not consistent, should be updated -module contains jdom which is already part of nutch, should instead use existing one -no junit tests, not strictly a requirement but a big plus is to have some!
          Hide
          Alan Tanaman added a comment -

          Sami,

          About your questions - thank you for looking at this plugin. I will be
          seeing to all of them and will respond over the next week, as currently have
          a couple of stressed clients...

          Best regards,
          Alan

          Show
          Alan Tanaman added a comment - Sami, About your questions - thank you for looking at this plugin. I will be seeing to all of them and will respond over the next week, as currently have a couple of stressed clients... Best regards, Alan
          Hide
          Nathan ter Bogt added a comment -

          Has anyone got the binary version of this module to work? I get to the indexing part and get the following error:

          Exception in thread "main" java.io.IOException: Job failed!
          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
          at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
          at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

          And this is what I get in my hadoop log:

          2007-03-07 15:26:33,272 INFO indexer.Indexer - Optimizing index.
          2007-03-07 15:26:33,275 WARN mapred.LocalJobRunner - job_qq3l2z
          java.lang.NoClassDefFoundError: org/jdom/JDOMException
          at org.apache.nutch.indexer.extra.ExtraIndexingFilter.filter(ExtraIndexingFilter.java:68)
          at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:72)
          at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:235)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
          at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)

          Any help would be greatly appreciated. Lastly, I'm all for the query-extra plugin also.

          Show
          Nathan ter Bogt added a comment - Has anyone got the binary version of this module to work? I get to the indexing part and get the following error: Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) at org.apache.nutch.indexer.Indexer.index(Indexer.java:296) at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) And this is what I get in my hadoop log: 2007-03-07 15:26:33,272 INFO indexer.Indexer - Optimizing index. 2007-03-07 15:26:33,275 WARN mapred.LocalJobRunner - job_qq3l2z java.lang.NoClassDefFoundError: org/jdom/JDOMException at org.apache.nutch.indexer.extra.ExtraIndexingFilter.filter(ExtraIndexingFilter.java:68) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:72) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:235) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112) Any help would be greatly appreciated. Lastly, I'm all for the query-extra plugin also.
          Hide
          Nathan ter Bogt added a comment -

          Sorry all,

          I managed to get this working. Just had some issues with the jdom library (or lack thereof).
          I must have just misread the error earlier.

          Fantastic plugin idea too, thanks!

          Show
          Nathan ter Bogt added a comment - Sorry all, I managed to get this working. Just had some issues with the jdom library (or lack thereof). I must have just misread the error earlier. Fantastic plugin idea too, thanks!
          Hide
          Peter Boot added a comment -

          I am getting errors when trying to compile this plugin with the trunk.
          Has anyone managed to update it ?
          Is there a better way to get Nutch to create termVectors ?

          [echo] Compiling plugin: index-extra
          [javac] Compiling 3 source files to /opt/nutch-trunk/build/index-extra/classes
          [javac] /opt/nutch-trunk/src/plugin/index-extra/src/java/org/apache/nutch/indexer/extra/ExtraIndexingFilter.java:61:
          org.apache.nutch.indexer.extra.ExtraIndexingFilter is not abstract and
          does not override abstract method
          filter(org.apache.lucene.document.Document,org.apache.nutch.parse.Parse,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.nutch.crawl.Inlinks)
          in org.apache.nutch.indexer.IndexingFilter
          [javac] public class ExtraIndexingFilter implements IndexingFilter {
          [javac] ^
          [javac] Note: /opt/nutch-trunk/src/plugin/index-extra/src/java/org/apache/nutch/indexer/extra/ExtraIndexingFilter.java
          uses or overrides a deprecated API.

          Show
          Peter Boot added a comment - I am getting errors when trying to compile this plugin with the trunk. Has anyone managed to update it ? Is there a better way to get Nutch to create termVectors ? [echo] Compiling plugin: index-extra [javac] Compiling 3 source files to /opt/nutch-trunk/build/index-extra/classes [javac] /opt/nutch-trunk/src/plugin/index-extra/src/java/org/apache/nutch/indexer/extra/ExtraIndexingFilter.java:61: org.apache.nutch.indexer.extra.ExtraIndexingFilter is not abstract and does not override abstract method filter(org.apache.lucene.document.Document,org.apache.nutch.parse.Parse,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.nutch.crawl.Inlinks) in org.apache.nutch.indexer.IndexingFilter [javac] public class ExtraIndexingFilter implements IndexingFilter { [javac] ^ [javac] Note: /opt/nutch-trunk/src/plugin/index-extra/src/java/org/apache/nutch/indexer/extra/ExtraIndexingFilter.java uses or overrides a deprecated API.
          Hide
          Alex McLintock added a comment -

          May I ask if this code still works with Nutch 1.0?

          Thanks

          Show
          Alex McLintock added a comment - May I ask if this code still works with Nutch 1.0? Thanks
          Hide
          Morille Jerome added a comment -

          No It don't work with nutch version 1.0
          He still use the Lucene Document and not NutchDocument.in new Apis.
          It easy to correct.

          If you want to use it, Take care with this code,a fast read you can see :

          • InputStream was open and never close
          • Exception cath to Null

          The idear is good,
          Nutch distribution plugin don't permit to customize easly Index data.

          They are something to do !!!

          Show
          Morille Jerome added a comment - No It don't work with nutch version 1.0 He still use the Lucene Document and not NutchDocument.in new Apis. It easy to correct. If you want to use it, Take care with this code,a fast read you can see : InputStream was open and never close Exception cath to Null The idear is good, Nutch distribution plugin don't permit to customize easly Index data. They are something to do !!!
          Hide
          garpinc added a comment -

          I don't see the meta tags in the Parse Object.. What might I be doing wrong..

          I've attached Nutch 1.0 code

          Show
          garpinc added a comment - I don't see the meta tags in the Parse Object.. What might I be doing wrong.. I've attached Nutch 1.0 code
          Hide
          Dietrich Schmidt added a comment -

          This is not compatible with Nutch 2. I made a quick attempt to refactor, but it is too complex without
          good understanding of the Nutch architecture. Anyone else tried their luck?

          Show
          Dietrich Schmidt added a comment - This is not compatible with Nutch 2. I made a quick attempt to refactor, but it is too complex without good understanding of the Nutch architecture. Anyone else tried their luck?
          Show
          Julien Nioche added a comment - Functionality implemented in : https://issues.apache.org/jira/browse/NUTCH-1264 and https://issues.apache.org/jira/browse/NUTCH-940
          Hide
          Manuel Antonio Novoa added a comment -

          I use this plugin to index properties and html img tags?

          For example <img alt=" this is the text I want to index "src=" this is another text that I want to index ">

          Show
          Manuel Antonio Novoa added a comment - I use this plugin to index properties and html img tags? For example <img alt=" this is the text I want to index "src=" this is another text that I want to index ">

            People

            • Assignee:
              Sami Siren
              Reporter:
              Alan Tanaman
            • Votes:
              4 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development