Nutch
  1. Nutch
  2. NUTCH-1140

index-more plugin, resetTitle method creates multiple values in the Title field

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: 1.11
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema:
      http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8

      The following patch removes the title field before adding a new one, which has resolved the issue for me:

      — MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
      +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
      @@ -276,6 +276,7 @@
      for (int i=0; i<patterns.length; i++) {
      if (matcher.contains(contentDisposition,patterns[i]))

      { result = matcher.getMatch(); + doc.removeField("title"); doc.add("title", result.group(1)); break; }
      1. 0001-NUTCH-1140-2.x.patch
        1 kB
        kaveh minooie
      2. 0001-NUTCH-1140-trunk.patch
        1 kB
        kaveh minooie
      3. MoreIndexingFilter.093011.patch
        0.4 kB
        Joe Liedtke
      4. NUTCH-1140-trunk-v2.patch
        2 kB
        Sebastian Nagel

        Issue Links

          Activity

          Hide
          Joe Liedtke added a comment -

          Proposed patch

          Show
          Joe Liedtke added a comment - Proposed patch
          Hide
          Markus Jelsma added a comment -

          Multiple titles are not always bad but empty titles are. The linked issue already fixes an issue with empty titles. Can you test a 1.4 check out?

          Show
          Markus Jelsma added a comment - Multiple titles are not always bad but empty titles are. The linked issue already fixes an issue with empty titles. Can you test a 1.4 check out?
          Hide
          Joe Liedtke added a comment -

          True, however the default schema only allows for one title. It seems like the filter should either make this behavior configurable or reset the title. Additionally, since the method is named resetTitle (and not addTitle, appendTitle, or insertYetAnotherTitle) I can only assume that the intent was to reset the title with a new value rather than append a second value.

          The patch for #1004 should help to mitigate the issue (I haven't had a chance to test it yet, but it makes sense that it could keep this from coming up...), however future plugins could cause this bug to rear its ugly head again. I'd recommend fixing it now to save future headaches. How does that sound?

          Show
          Joe Liedtke added a comment - True, however the default schema only allows for one title. It seems like the filter should either make this behavior configurable or reset the title. Additionally, since the method is named resetTitle (and not addTitle, appendTitle, or insertYetAnotherTitle) I can only assume that the intent was to reset the title with a new value rather than append a second value. The patch for #1004 should help to mitigate the issue (I haven't had a chance to test it yet, but it makes sense that it could keep this from coming up...), however future plugins could cause this bug to rear its ugly head again. I'd recommend fixing it now to save future headaches. How does that sound?
          Hide
          Joe Liedtke added a comment -

          Thanks!

          Show
          Joe Liedtke added a comment - Thanks!
          Hide
          Lewis John McGibbney added a comment -

          Hi Joe. This one seems to have slipped under the radar somewhat!
          Can you please attach a patch under 1.5 (trunk) please ?
          Thank you if this is possible.

          Show
          Lewis John McGibbney added a comment - Hi Joe. This one seems to have slipped under the radar somewhat! Can you please attach a patch under 1.5 (trunk) please ? Thank you if this is possible.
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Lewis John McGibbney added a comment -

          DO we want to integrate this into trunk and 2.x? If so I can write the trivial test case?

          Show
          Lewis John McGibbney added a comment - DO we want to integrate this into trunk and 2.x? If so I can write the trivial test case?
          Hide
          kaveh minooie added a comment -

          so this is still an issue, here is a sample list of urls in the wild that would trigger this problem:

          http://www.10-s.com/site/tennis-supply/site-map.html
          http://www.bigappleherp.com/site/content/big_apple_cares.html
          http://www.bigappleherp.com/site/content/CareSheets.html
          http://www.bigappleherp.com/site/content/company_information.html
          http://www.bigappleherp.com/site/content/customer_service.html
          http://www.bigappleherp.com/site/content/LiveAnimals.html
          http://www.bigappleherp.com/site/content/testimonials_02.html
          http://www.magellangps.com/lp/truckfamily/screens.html

          Now base on a bit of a reading that I did on Content Disposition, it is a reasonable alternative way of determining a title which would mostly be just the file name, but it should NOT override the actual title if it exist as the information in the title are far more valueable than the file name. Not to mention that title is the actual title and should not be replaced if some other value exist.

          Show
          kaveh minooie added a comment - so this is still an issue, here is a sample list of urls in the wild that would trigger this problem: http://www.10-s.com/site/tennis-supply/site-map.html http://www.bigappleherp.com/site/content/big_apple_cares.html http://www.bigappleherp.com/site/content/CareSheets.html http://www.bigappleherp.com/site/content/company_information.html http://www.bigappleherp.com/site/content/customer_service.html http://www.bigappleherp.com/site/content/LiveAnimals.html http://www.bigappleherp.com/site/content/testimonials_02.html http://www.magellangps.com/lp/truckfamily/screens.html Now base on a bit of a reading that I did on Content Disposition, it is a reasonable alternative way of determining a title which would mostly be just the file name, but it should NOT override the actual title if it exist as the information in the title are far more valueable than the file name. Not to mention that title is the actual title and should not be replaced if some other value exist.
          Hide
          kaveh minooie added a comment -

          Sorry, there was a typo in both the patch files

          Show
          kaveh minooie added a comment - Sorry, there was a typo in both the patch files
          Hide
          Lewis John McGibbney added a comment -

          Any issues with committing this fix? I've just run into this issue as well and the most recent patches and comments as suggested by numerous people on this thread solve the issue without hacking the schema in such a way as to have multi-valued titles for a document... which is illogical.

          Show
          Lewis John McGibbney added a comment - Any issues with committing this fix? I've just run into this issue as well and the most recent patches and comments as suggested by numerous people on this thread solve the issue without hacking the schema in such a way as to have multi-valued titles for a document... which is illogical.
          Hide
          Sebastian Nagel added a comment -

          +1 (extended patch to include a test)

          Show
          Sebastian Nagel added a comment - +1 (extended patch to include a test)
          Hide
          Lewis John McGibbney added a comment -

          +1 Seb, can you commit for 1.X and I will port to 2.X?

          Show
          Lewis John McGibbney added a comment - +1 Seb, can you commit for 1.X and I will port to 2.X?
          Hide
          Sebastian Nagel added a comment -

          Committed to trunk, r1650181.

          Show
          Sebastian Nagel added a comment - Committed to trunk, r1650181.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Nutch-trunk #2923 (See https://builds.apache.org/job/Nutch-trunk/2923/)
          NUTCH-1140 index-more plugin, resetTitle creates multiple values in title field (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1650181)

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
          • /nutch/trunk/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Nutch-trunk #2923 (See https://builds.apache.org/job/Nutch-trunk/2923/ ) NUTCH-1140 index-more plugin, resetTitle creates multiple values in title field (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1650181 ) /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java /nutch/trunk/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java

            People

            • Assignee:
              Unassigned
              Reporter:
              Joe Liedtke
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development