Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2198

Indexing binary content by index-html causes Solr Exception

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • 2.3.1
    • 2.5
    • indexer
    • None

    Description

      (reported by kalanya in NUTCH-2168)
      If raw binary is indexed using the plugin index-html this may cause an exception in Solr:

      2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
      2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
      2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
      2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
      2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
      java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263)
      

      The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset (cf. Scanner API docs):

      HtmlIndexingFilter.java
                  Scanner scanner = new Scanner(arrayInputStream);
                  scanner.useDelimiter("\\Z");//To read all scanner content in one String
                  String data = "";
                  if (scanner.hasNext()) {
                      data = scanner.next();
                  }
                  doc.add("rawcontent", StringUtil.cleanField(data));
      

      The field "rawcontent" is of type "string":

      conf/schema.xml
          <!-- fields for index-html plugin
               Note: although raw document content may be binary,
                     index-html adds a String to the index field -->
          <field name="rawcontent" type="string" stored="true" indexed="false"/>
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: