Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.3
-
None
-
None
-
Patch Available
Description
From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema:
http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
The following patch removes the title field before adding a new one, which has resolved the issue for me:
— MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
+++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
@@ -276,6 +276,7 @@
for (int i=0; i<patterns.length; i++) {
if (matcher.contains(contentDisposition,patterns[i]))
Attachments
Attachments
Issue Links
- duplicates
-
NUTCH-1004 Do not index empty values for title field
- Closed