Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1140

index-more plugin, resetTitle method creates multiple values in the Title field

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: None
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema:
      http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8

      The following patch removes the title field before adding a new one, which has resolved the issue for me:

      — MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
      +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
      @@ -276,6 +276,7 @@
      for (int i=0; i<patterns.length; i++) {
      if (matcher.contains(contentDisposition,patterns[i]))

      { result = matcher.getMatch(); + doc.removeField("title"); doc.add("title", result.group(1)); break; }

        Attachments

        1. 0001-NUTCH-1140-2.x.patch
          1 kB
          kaveh minooie
        2. 0001-NUTCH-1140-trunk.patch
          1 kB
          kaveh minooie
        3. MoreIndexingFilter.093011.patch
          0.4 kB
          Joe Liedtke
        4. NUTCH-1140-trunk-v2.patch
          2 kB
          Sebastian Nagel

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                joe.liedtke Joe Liedtke
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: