Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1140

index-more plugin, resetTitle method creates multiple values in the Title field

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.3
    • None
    • indexer
    • None
    • Patch Available

    Description

      From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema:
      http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8

      The following patch removes the title field before adding a new one, which has resolved the issue for me:

      — MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
      +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
      @@ -276,6 +276,7 @@
      for (int i=0; i<patterns.length; i++) {
      if (matcher.contains(contentDisposition,patterns[i]))

      { result = matcher.getMatch(); + doc.removeField("title"); doc.add("title", result.group(1)); break; }

      Attachments

        1. NUTCH-1140-trunk-v2.patch
          2 kB
          Sebastian Nagel
        2. 0001-NUTCH-1140-trunk.patch
          1 kB
          kaveh minooie
        3. 0001-NUTCH-1140-2.x.patch
          1 kB
          kaveh minooie
        4. MoreIndexingFilter.093011.patch
          0.4 kB
          Joe Liedtke

        Issue Links

          Activity

            People

              Unassigned Unassigned
              joe.liedtke Joe Liedtke
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: