Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-675

MCF-ES fails to escape json correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • ManifoldCF 1.2
    • ManifoldCF 1.2
    • None

    Description

      When crawling filesystem to elasticsearch, the generated json contains invalid utf-8 sequences. This causes elasticsearch to fail the index operation.

      Stacktrace from elasticsearch:

      [2013-04-19 13:17:38,952][DEBUG][action.index] [Lighting Rod] [eses2][0], node[Ycj8DEZMQFuX7Gn2sSCUXw],
      [P], s[STARTED]: Failed to execute [index 
      {[eses][attachment][file:/C:/indexdir/Lüneburg/somefile],
      source[{"uri" : "C:\\indexdir\\L�neburg\\somefile", 
      "allow_token_document" :
      "__nosecurity__","deny_token_document" : "__nosecurity__","allow_token_share" : "__nosecurity__","deny_token_share" :
      "__nosecurity__","type" : "attachment","_name" : "collection.pickle","file" : "KGRwMQp.....
      
      org.elasticsearch.index.mapper.MapperParsingException: failed to parse [uri]
      at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:395)
      at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:599)
      at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:467)
      at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:506)
      at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:450)
      at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:326)
      at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
      at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
      at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:722)
      Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc
      at [Source: [B@56c77e95; line: 1, column: 254]
      at org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1378)
      at org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599)
      at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3008)
      at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3002)
      at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2165)
      at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2092)
      at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:275)
      at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:85)
      at org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:107)
      at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:286)
      at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:384)
      ... 11 more
      

      In this case it is a german umlaut 'ü', but since ElasticSearchIndex#jsonStringEscape() doesn't do much more than escaping backslashes, I assume this affects a wider range of encoding specialities.

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            konradkonrad konrad
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: