[NUTCH-2198] Indexing binary content by index-html causes Solr Exception - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Auto Closed
Affects Version/s: 2.3.1
Fix Version/s: 2.5
Component/s: indexer
Labels:
None

Description

(reported by kalanya in ~~NUTCH-2168~~)
If raw binary is indexed using the plugin index-html this may cause an exception in Solr:

2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263)

The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset (cf. Scanner API docs):

HtmlIndexingFilter.java

            Scanner scanner = new Scanner(arrayInputStream);
            scanner.useDelimiter("\\Z");//To read all scanner content in one String
            String data = "";
            if (scanner.hasNext()) {
                data = scanner.next();
            }
            doc.add("rawcontent", StringUtil.cleanField(data));

The field "rawcontent" is of type "string":

conf/schema.xml

    <!-- fields for index-html plugin
         Note: although raw document content may be binary,
               index-html adds a String to the index field -->
    <field name="rawcontent" type="string" stored="true" indexed="false"/>

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jan/16 13:16

Updated:: 13/Oct/19 22:36

Resolved:: 13/Oct/19 22:36