Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-260

Three new plugins that parse, index and query meta tags defined in the configuration

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 0.7.2
    • None
    • indexer
    • None
    • Built and tested on Linux so far.

    Description

      These plugins allow you to define meta tags in you're nutch-site file that you want to include in parseing, indexing and searching. The query plugin must replace query-basic. The format for adding query terms to nutch-site.xml is:

      <property>
      <name>meta.names</name>
      <value>keywords,recommended</value>
      <description>This is a comma seperated list of meta tag names that will
      be parsed, indexed and searched against when parse-meta, index-meta and
      query-meta are used.</description>
      </property>

      <property>
      <name>meta.boosts</name>
      <value>1.0,5.0</value>
      <description>Comma seperated list of boost values when searching using
      query-meta. The order of the values should match the order of meta.names.
      </description>
      </property>

      Meta tags found are assumed to have either a single value or be a comma seperated list of values. The values found are added to the index as lucene keywords (i.e. meta name=keywords values="First Thing, Second Thing" would result in two keyword fields named "keywords". The first would countain "First Thing" and the second would contain "Second Thing").

      I had to replace the query-basic plugin in order to allow matches in the meta fields to return hits even if there were no matches in any of the default fields. The query-basic field only returns hits when every search term is found in at least one default field. I needed hits returned if matches were found in at least one field for every term, and/or the entire search phrase appeared in a meta index field.

      One known bug is that common terms are not getting stripped out of the fields' values before they get indexed, so "The Next Big Thing" could not be matched because the query engine will strip out "the" from all queries. I intend to fix this by stipping out common terms from meta fields before indexing them.

      Another issue is that searching for "Next Big Thing" would not match meta index values for "Next", "Big" or "Thing". You can consider that a bug or a feature depending on how you look at it.

      These plugins were written for and only work on the 0.7.2 branch.

      I'm going to attache a tarball of the source of these three plugins after I create the issue. To use the plugins, you'll need to untar them in your src/plugins directory and add them to the ant build.xml directive (and of course add them in your nutch-site.xml file). If these end up getting added to the project, I'll write up documentation on the wiki.

      Attachments

        1. nutch_customizations.tar
          40 kB
          Jake Vanderdray

        Activity

          People

            Unassigned Unassigned
            clumpidy Jake Vanderdray
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: