Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
ManifoldCF 1.2
-
None
-
MCF 1.2-SNAPSHOT running on Win2008R2.
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
----------------
elasticsearch 0.90.0rc2 on ubuntu 12.10
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
-----------------
Repository Connection: FileSystem
Output Connection: ElasticSearchMCF 1.2-SNAPSHOT running on Win2008R2. java version "1.7.0_15" Java(TM) SE Runtime Environment (build 1.7.0_15-b03) Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode) ---------------- elasticsearch 0.90.0rc2 on ubuntu 12.10 java version "1.7.0_15" Java(TM) SE Runtime Environment (build 1.7.0_15-b03) Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode) ----------------- Repository Connection: FileSystem Output Connection: ElasticSearch
Description
When crawling filesystem to elasticsearch, the generated json contains invalid utf-8 sequences. This causes elasticsearch to fail the index operation.
Stacktrace from elasticsearch:
[2013-04-19 13:17:38,952][DEBUG][action.index] [Lighting Rod] [eses2][0], node[Ycj8DEZMQFuX7Gn2sSCUXw], [P], s[STARTED]: Failed to execute [index {[eses][attachment][file:/C:/indexdir/Lüneburg/somefile], source[{"uri" : "C:\\indexdir\\L�neburg\\somefile", "allow_token_document" : "__nosecurity__","deny_token_document" : "__nosecurity__","allow_token_share" : "__nosecurity__","deny_token_share" : "__nosecurity__","type" : "attachment","_name" : "collection.pickle","file" : "KGRwMQp..... org.elasticsearch.index.mapper.MapperParsingException: failed to parse [uri] at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:395) at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:599) at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:467) at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:506) at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:450) at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:326) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc at [Source: [B@56c77e95; line: 1, column: 254] at org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1378) at org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599) at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3008) at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3002) at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2165) at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2092) at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:275) at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:85) at org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:107) at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:286) at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:384) ... 11 more
In this case it is a german umlaut 'ü', but since ElasticSearchIndex#jsonStringEscape() doesn't do much more than escaping backslashes, I assume this affects a wider range of encoding specialities.