Nutch
  1. Nutch
  2. NUTCH-1140

index-more plugin, resetTitle method creates multiple values in the Title field

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: 1.11
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema:
      http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8

      The following patch removes the title field before adding a new one, which has resolved the issue for me:

      — MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
      +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
      @@ -276,6 +276,7 @@
      for (int i=0; i<patterns.length; i++) {
      if (matcher.contains(contentDisposition,patterns[i]))

      { result = matcher.getMatch(); + doc.removeField("title"); doc.add("title", result.group(1)); break; }
      1. MoreIndexingFilter.093011.patch
        0.4 kB
        Joe Liedtke
      2. 0001-NUTCH-1140-2.x.patch
        1 kB
        kaveh minooie
      3. 0001-NUTCH-1140-trunk.patch
        1 kB
        kaveh minooie
      4. NUTCH-1140-trunk-v2.patch
        2 kB
        Sebastian Nagel

        Issue Links

          Activity

          Lewis John McGibbney made changes -
          Fix Version/s 1.11 [ 12329358 ]
          Fix Version/s 1.10 [ 12327187 ]
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Nutch-trunk #2923 (See https://builds.apache.org/job/Nutch-trunk/2923/)
          NUTCH-1140 index-more plugin, resetTitle creates multiple values in title field (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1650181)

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
          • /nutch/trunk/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Nutch-trunk #2923 (See https://builds.apache.org/job/Nutch-trunk/2923/ ) NUTCH-1140 index-more plugin, resetTitle creates multiple values in title field (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1650181 ) /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java /nutch/trunk/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
          Hide
          Sebastian Nagel added a comment -

          Committed to trunk, r1650181.

          Show
          Sebastian Nagel added a comment - Committed to trunk, r1650181.
          Hide
          Lewis John McGibbney added a comment -

          +1 Seb, can you commit for 1.X and I will port to 2.X?

          Show
          Lewis John McGibbney added a comment - +1 Seb, can you commit for 1.X and I will port to 2.X?
          Sebastian Nagel made changes -
          Attachment NUTCH-1140-trunk-v2.patch [ 12690560 ]
          Hide
          Sebastian Nagel added a comment -

          +1 (extended patch to include a test)

          Show
          Sebastian Nagel added a comment - +1 (extended patch to include a test)
          Hide
          Lewis John McGibbney added a comment -

          Any issues with committing this fix? I've just run into this issue as well and the most recent patches and comments as suggested by numerous people on this thread solve the issue without hacking the schema in such a way as to have multi-valued titles for a document... which is illogical.

          Show
          Lewis John McGibbney added a comment - Any issues with committing this fix? I've just run into this issue as well and the most recent patches and comments as suggested by numerous people on this thread solve the issue without hacking the schema in such a way as to have multi-valued titles for a document... which is illogical.
          kaveh minooie made changes -
          Attachment 0001-NUTCH-1140-2.x.patch [ 12680628 ]
          Attachment 0001-NUTCH-1140-trunk.patch [ 12680629 ]
          Hide
          kaveh minooie added a comment -

          Sorry, there was a typo in both the patch files

          Show
          kaveh minooie added a comment - Sorry, there was a typo in both the patch files
          kaveh minooie made changes -
          Attachment 0001-NUTCH-1140-trunk.patch [ 12680230 ]
          kaveh minooie made changes -
          Attachment 0001-NUTCH-1140-2.x.patch [ 12680229 ]
          kaveh minooie made changes -
          Attachment 0001-NUTCH-1140-2.x.patch [ 12680229 ]
          Attachment 0001-NUTCH-1140-trunk.patch [ 12680230 ]
          Hide
          kaveh minooie added a comment -

          so this is still an issue, here is a sample list of urls in the wild that would trigger this problem:

          http://www.10-s.com/site/tennis-supply/site-map.html
          http://www.bigappleherp.com/site/content/big_apple_cares.html
          http://www.bigappleherp.com/site/content/CareSheets.html
          http://www.bigappleherp.com/site/content/company_information.html
          http://www.bigappleherp.com/site/content/customer_service.html
          http://www.bigappleherp.com/site/content/LiveAnimals.html
          http://www.bigappleherp.com/site/content/testimonials_02.html
          http://www.magellangps.com/lp/truckfamily/screens.html

          Now base on a bit of a reading that I did on Content Disposition, it is a reasonable alternative way of determining a title which would mostly be just the file name, but it should NOT override the actual title if it exist as the information in the title are far more valueable than the file name. Not to mention that title is the actual title and should not be replaced if some other value exist.

          Show
          kaveh minooie added a comment - so this is still an issue, here is a sample list of urls in the wild that would trigger this problem: http://www.10-s.com/site/tennis-supply/site-map.html http://www.bigappleherp.com/site/content/big_apple_cares.html http://www.bigappleherp.com/site/content/CareSheets.html http://www.bigappleherp.com/site/content/company_information.html http://www.bigappleherp.com/site/content/customer_service.html http://www.bigappleherp.com/site/content/LiveAnimals.html http://www.bigappleherp.com/site/content/testimonials_02.html http://www.magellangps.com/lp/truckfamily/screens.html Now base on a bit of a reading that I did on Content Disposition, it is a reasonable alternative way of determining a title which would mostly be just the file name, but it should NOT override the actual title if it exist as the information in the title are far more valueable than the file name. Not to mention that title is the actual title and should not be replaced if some other value exist.
          Julien Nioche made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.9 [ 12324611 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 1.8 [ 12324326 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.8 [ 12324326 ]
          Fix Version/s 1.7 [ 12323281 ]
          Hide
          Lewis John McGibbney added a comment -

          DO we want to integrate this into trunk and 2.x? If so I can write the trivial test case?

          Show
          Lewis John McGibbney added a comment - DO we want to integrate this into trunk and 2.x? If so I can write the trivial test case?
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 1.6 [ 12319941 ]
          Markus Jelsma made changes -
          Fix Version/s 1.6 [ 12319941 ]
          Fix Version/s 1.5 [ 12318246 ]
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Lewis John McGibbney added a comment -

          Hi Joe. This one seems to have slipped under the radar somewhat!
          Can you please attach a patch under 1.5 (trunk) please ?
          Thank you if this is possible.

          Show
          Lewis John McGibbney added a comment - Hi Joe. This one seems to have slipped under the radar somewhat! Can you please attach a patch under 1.5 (trunk) please ? Thank you if this is possible.
          Hide
          Joe Liedtke added a comment -

          Thanks!

          Show
          Joe Liedtke added a comment - Thanks!
          Markus Jelsma made changes -
          Fix Version/s 1.5 [ 12318246 ]
          Hide
          Joe Liedtke added a comment -

          True, however the default schema only allows for one title. It seems like the filter should either make this behavior configurable or reset the title. Additionally, since the method is named resetTitle (and not addTitle, appendTitle, or insertYetAnotherTitle) I can only assume that the intent was to reset the title with a new value rather than append a second value.

          The patch for #1004 should help to mitigate the issue (I haven't had a chance to test it yet, but it makes sense that it could keep this from coming up...), however future plugins could cause this bug to rear its ugly head again. I'd recommend fixing it now to save future headaches. How does that sound?

          Show
          Joe Liedtke added a comment - True, however the default schema only allows for one title. It seems like the filter should either make this behavior configurable or reset the title. Additionally, since the method is named resetTitle (and not addTitle, appendTitle, or insertYetAnotherTitle) I can only assume that the intent was to reset the title with a new value rather than append a second value. The patch for #1004 should help to mitigate the issue (I haven't had a chance to test it yet, but it makes sense that it could keep this from coming up...), however future plugins could cause this bug to rear its ugly head again. I'd recommend fixing it now to save future headaches. How does that sound?
          Hide
          Markus Jelsma added a comment -

          Multiple titles are not always bad but empty titles are. The linked issue already fixes an issue with empty titles. Can you test a 1.4 check out?

          Show
          Markus Jelsma added a comment - Multiple titles are not always bad but empty titles are. The linked issue already fixes an issue with empty titles. Can you test a 1.4 check out?
          Markus Jelsma made changes -
          Link This issue duplicates NUTCH-1004 [ NUTCH-1004 ]
          Joe Liedtke made changes -
          Description From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process [and based on messages in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema -- http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8].

          The following patch removes the title field before adding a new one, which has resolved the issue for me:


          --- MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
          +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
          @@ -276,6 +276,7 @@
               for (int i=0; i<patterns.length; i++) {
                 if (matcher.contains(contentDisposition,patterns[i])) {
                   result = matcher.getMatch();
          + doc.removeField("title");
                   doc.add("title", result.group(1));
                   break;
                 }


          From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema:
            http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8

          The following patch removes the title field before adding a new one, which has resolved the issue for me:


          --- MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
          +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
          @@ -276,6 +276,7 @@
               for (int i=0; i<patterns.length; i++) {
                 if (matcher.contains(contentDisposition,patterns[i])) {
                   result = matcher.getMatch();
          + doc.removeField("title");
                   doc.add("title", result.group(1));
                   break;
                 }


          Joe Liedtke made changes -
          Patch Info Patch Available [ 10042 ]
          Joe Liedtke made changes -
          Field Original Value New Value
          Attachment MoreIndexingFilter.093011.patch [ 12497199 ]
          Hide
          Joe Liedtke added a comment -

          Proposed patch

          Show
          Joe Liedtke added a comment - Proposed patch
          Joe Liedtke created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Joe Liedtke
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development