Nutch
  1. Nutch
  2. NUTCH-1140

index-more plugin, resetTitle method creates multiple values in the Title field

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: 1.9
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema:
      http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8

      The following patch removes the title field before adding a new one, which has resolved the issue for me:

      — MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
      +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
      @@ -276,6 +276,7 @@
      for (int i=0; i<patterns.length; i++) {
      if (matcher.contains(contentDisposition,patterns[i]))

      { result = matcher.getMatch(); + doc.removeField("title"); doc.add("title", result.group(1)); break; }

        Issue Links

          Activity

          Hide
          Lewis John McGibbney added a comment -

          DO we want to integrate this into trunk and 2.x? If so I can write the trivial test case?

          Show
          Lewis John McGibbney added a comment - DO we want to integrate this into trunk and 2.x? If so I can write the trivial test case?
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Lewis John McGibbney added a comment -

          Hi Joe. This one seems to have slipped under the radar somewhat!
          Can you please attach a patch under 1.5 (trunk) please ?
          Thank you if this is possible.

          Show
          Lewis John McGibbney added a comment - Hi Joe. This one seems to have slipped under the radar somewhat! Can you please attach a patch under 1.5 (trunk) please ? Thank you if this is possible.
          Hide
          Joe Liedtke added a comment -

          Thanks!

          Show
          Joe Liedtke added a comment - Thanks!
          Hide
          Joe Liedtke added a comment -

          True, however the default schema only allows for one title. It seems like the filter should either make this behavior configurable or reset the title. Additionally, since the method is named resetTitle (and not addTitle, appendTitle, or insertYetAnotherTitle) I can only assume that the intent was to reset the title with a new value rather than append a second value.

          The patch for #1004 should help to mitigate the issue (I haven't had a chance to test it yet, but it makes sense that it could keep this from coming up...), however future plugins could cause this bug to rear its ugly head again. I'd recommend fixing it now to save future headaches. How does that sound?

          Show
          Joe Liedtke added a comment - True, however the default schema only allows for one title. It seems like the filter should either make this behavior configurable or reset the title. Additionally, since the method is named resetTitle (and not addTitle, appendTitle, or insertYetAnotherTitle) I can only assume that the intent was to reset the title with a new value rather than append a second value. The patch for #1004 should help to mitigate the issue (I haven't had a chance to test it yet, but it makes sense that it could keep this from coming up...), however future plugins could cause this bug to rear its ugly head again. I'd recommend fixing it now to save future headaches. How does that sound?
          Hide
          Markus Jelsma added a comment -

          Multiple titles are not always bad but empty titles are. The linked issue already fixes an issue with empty titles. Can you test a 1.4 check out?

          Show
          Markus Jelsma added a comment - Multiple titles are not always bad but empty titles are. The linked issue already fixes an issue with empty titles. Can you test a 1.4 check out?
          Hide
          Joe Liedtke added a comment -

          Proposed patch

          Show
          Joe Liedtke added a comment - Proposed patch

            People

            • Assignee:
              Unassigned
              Reporter:
              Joe Liedtke
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development