Flume
  1. Flume
  2. FLUME-2089

ElasticsearchSink blocks and raises exceptions when event body has unexpected encoding

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: v1.4.0, v1.3.1
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:
      None
    • Release Note:
      ElasticsearchSink now handles event bodies with unexpected encodings or parse failures by storing them as simple fields

      Description

      Detected by Allan Feid and documented on the user list http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CCAN94UWe6UvcOKT1S%2BXANC-sy0qFsxet3RJY9PVkj-eSfO5fk6Q%40mail.gmail.com%3E

      Steps:
      Send an event with the body as follows:
      foo¤data¤1371126476.436¤0.005¤555¤10.1.1.1¤HTTP/1.1¤GET¤http¤vhost¤/path/url¤¤-¤200¤
      referrer.com/search/?query=\x8D\x91\x89\xEF\x8Bc\x8E\x96\x93\xB0¤¤¤-

      Expected Results:
      The event is stored in elasticsearch.

      Actual Results:
      >> 10 Jun 2013 09:52:34,360 ERROR
      >> [SinkRunner-PollingRunner-DefaultSinkProcessor]
      >> (org.apache.flume.SinkRunner$PollingRunner.run:160) - Unable to deliver
      >> event. Exception follows.
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException:
      >> java.io.CharConversionException: Invalid UTF-8 start byte 0xfc (at char
      >> #81, byte #-1)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:198)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.<init>(StreamReader.java:62)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLParser.<init>(YAMLParser.java:147)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory._createParser(YAMLFactory.java:530)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory.createJsonParser(YAMLFactory.java:420)
      >> at
      >> org.elasticsearch.common.xcontent.yaml.YamlXContent.createParser(YamlXContent.java:83)
      >> at
      >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.addComplexField(ContentBuilderUtil.java:61)
      >> at
      >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.appendField(ContentBuilderUtil.java:47)
      >> at
      >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.appendBody(ElasticSearchLogStashEventSerializer.java:87)
      >> at
      >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.getContentBuilder(ElasticSearchLogStashEventSerializer.java:79)
      >> at
      >> org.apache.flume.sink.elasticsearch.ElasticSearchSink.process(ElasticSearchSink.java:178)
      >> at
      >> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
      >> at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
      >> at java.lang.Thread.run(Thread.java:662)
      >> Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xfc
      >> (at char #81, byte #-1)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.reportInvalidInitial(UTF8Reader.java:395)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:247)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:157)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:182)
      >> ... 13 more

      1. flume-2089.diff
        3 kB
        Edward Sargisson

        Issue Links

          Activity

          Edward Sargisson created issue -
          Hide
          Allan Feid added a comment - - edited
          diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml
          index 1632c23..33c296e 100644
          --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml
          +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml
          @@ -54,6 +54,7 @@
                   <groupId>org.elasticsearch</groupId>
                   <artifactId>elasticsearch</artifactId>
                   <optional>true</optional>
          +        <version>0.90.1</version>
               </dependency>
          
               <dependency>
          diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java
          index bf7c57c..a00aa18 100644
          --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java
          +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java
          @@ -23,6 +23,7 @@ import java.io.IOException;
           import java.nio.charset.Charset;
          
           import org.elasticsearch.common.jackson.core.JsonParseException;
          +import org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException;
           import org.elasticsearch.common.xcontent.XContentBuilder;
           import org.elasticsearch.common.xcontent.XContentFactory;
           import org.elasticsearch.common.xcontent.XContentParser;
          @@ -67,6 +68,9 @@ public class ContentBuilderUtil {
                 // can't be figured out in the body. At this point just push it through
                 // as is, we have already added the field so don't do it again
                 addSimpleField(builder, fieldName, data);
          +    } catch (YAMLException ex) {
          +      // Same as above, except for YAML based parsing problems
          +      addSimpleField(builder, fieldName, data);
               } finally {
                 if (parser != null) {
                   parser.close();
          
          Show
          Allan Feid added a comment - - edited diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml index 1632c23..33c296e 100644 --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml @@ -54,6 +54,7 @@ <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <optional>true</optional> + <version>0.90.1</version> </dependency> <dependency> diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java index bf7c57c..a00aa18 100644 --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java @@ -23,6 +23,7 @@ import java.io.IOException; import java.nio.charset.Charset; import org.elasticsearch.common.jackson.core.JsonParseException; +import org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException; import org.elasticsearch.common.xcontent.XContentBuilder; import org.elasticsearch.common.xcontent.XContentFactory; import org.elasticsearch.common.xcontent.XContentParser; @@ -67,6 +68,9 @@ public class ContentBuilderUtil { // can't be figured out in the body. At this point just push it through // as is, we have already added the field so don't do it again addSimpleField(builder, fieldName, data); + } catch (YAMLException ex) { + // Same as above, except for YAML based parsing problems + addSimpleField(builder, fieldName, data); } finally { if (parser != null) { parser.close();
          Allan Feid made changes -
          Field Original Value New Value
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note catch YAML parsing errors when adding data to elasticsearch
          Affects Version/s v1.4.0 [ 12323372 ]
          Hide
          Edward Sargisson added a comment -

          Hi Allan,
          Thank you for the patch - I'm really glad we've got it.

          A question for you: is this not mostly a character set encoding issue? You would know from your data - which is why I'm asking.
          i.e. ContentBuilderUtil in addSimpleField() uses the platform's default character set - which may not be appropriate for all data.
          I'm wondering if we need a parameter to the ElasticSearchSink to allow this to be set.

          Your thoughts?

          Secondly, a quibble. The es 0.90.1 upgrade is in FLUME-2049.

          Lastly, once you're happy, perhaps you could put your patch into a review board at reviews.apache.org and set the Group to Flume. That will progress it along.

          Thanks again!

          Show
          Edward Sargisson added a comment - Hi Allan, Thank you for the patch - I'm really glad we've got it. A question for you: is this not mostly a character set encoding issue? You would know from your data - which is why I'm asking. i.e. ContentBuilderUtil in addSimpleField() uses the platform's default character set - which may not be appropriate for all data. I'm wondering if we need a parameter to the ElasticSearchSink to allow this to be set. Your thoughts? Secondly, a quibble. The es 0.90.1 upgrade is in FLUME-2049 . Lastly, once you're happy, perhaps you could put your patch into a review board at reviews.apache.org and set the Group to Flume. That will progress it along. Thanks again!
          Hide
          Allan Feid added a comment -

          This probably has something to do with character encoding, but as far as I know all logs are utf-8 encoded. The problem is that there are 100s of log sources which generate logs based on web traffic, among other things. I can't absolutely be sure there isn't some random encoded characters floating around. It is interesting that the xContentType method in ES detects some of my events as YAML though, I can be fairly confident there's no YAML in my log messages.

          I'll look into adding a patch on reviews.apache.org, and be sure to remove the 0.90.1 upgrade.

          Show
          Allan Feid added a comment - This probably has something to do with character encoding, but as far as I know all logs are utf-8 encoded. The problem is that there are 100s of log sources which generate logs based on web traffic, among other things. I can't absolutely be sure there isn't some random encoded characters floating around. It is interesting that the xContentType method in ES detects some of my events as YAML though, I can be fairly confident there's no YAML in my log messages. I'll look into adding a patch on reviews.apache.org, and be sure to remove the 0.90.1 upgrade.
          Hide
          Edward Sargisson added a comment -

          Allan, one more thing. Would you be able to add a unit test for this new functionality?

          Show
          Edward Sargisson added a comment - Allan, one more thing. Would you be able to add a unit test for this new functionality?
          Hide
          Mike Percy added a comment -

          Yes, please attach the patch to the JIRA and add a unit test. I'd like to pull this into Flume 1.4, we will cut an RC very soon. Thanks Allan!

          Show
          Mike Percy added a comment - Yes, please attach the patch to the JIRA and add a unit test. I'd like to pull this into Flume 1.4, we will cut an RC very soon. Thanks Allan!
          Mike Percy made changes -
          Fix Version/s v1.4.0 [ 12323372 ]
          Hide
          Mike Percy added a comment -

          Sorry gents, looks like this one is going to miss the 1.4.0 boat.

          Show
          Mike Percy added a comment - Sorry gents, looks like this one is going to miss the 1.4.0 boat.
          Mike Percy made changes -
          Fix Version/s v1.4.0 [ 12323372 ]
          Edward Sargisson made changes -
          Summary ElasticSearchSink raises YAMLException when event body has unexpected encoding. ElasticsearchSink blocks raises exceptions when event body has unexpected encoding
          Edward Sargisson made changes -
          Description Detected by Allan Feid and documented on the user list http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CCAN94UWe6UvcOKT1S%2BXANC-sy0qFsxet3RJY9PVkj-eSfO5fk6Q%40mail.gmail.com%3E

          Steps:
          Send an event with the body as follows:
          foo¤data¤1371126476.436¤0.005¤555¤10.1.1.1¤HTTP/1.1¤GET¤http¤vhost¤/path/url¤¤-¤200¤
          referrer.com/search/?query=\x8D\x91\x89\xEF\x8Bc\x8E\x96\x93\xB0¤-¤-¤-

          Expected Results:
          The event is stored in elasticsearch.

          Actual Results:
          >> 10 Jun 2013 09:52:34,360 ERROR
          >> [SinkRunner-PollingRunner-DefaultSinkProcessor]
          >> (org.apache.flume.SinkRunner$PollingRunner.run:160) - Unable to deliver
          >> event. Exception follows.
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException:
          >> java.io.CharConversionException: Invalid UTF-8 start byte 0xfc (at char
          >> #81, byte #-1)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:198)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.<init>(StreamReader.java:62)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLParser.<init>(YAMLParser.java:147)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory._createParser(YAMLFactory.java:530)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory.createJsonParser(YAMLFactory.java:420)
          >> at
          >> org.elasticsearch.common.xcontent.yaml.YamlXContent.createParser(YamlXContent.java:83)
          >> at
          >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.addComplexField(ContentBuilderUtil.java:61)
          >> at
          >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.appendField(ContentBuilderUtil.java:47)
          >> at
          >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.appendBody(ElasticSearchLogStashEventSerializer.java:87)
          >> at
          >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.getContentBuilder(ElasticSearchLogStashEventSerializer.java:79)
          >> at
          >> org.apache.flume.sink.elasticsearch.ElasticSearchSink.process(ElasticSearchSink.java:178)
          >> at
          >> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
          >> at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
          >> at java.lang.Thread.run(Thread.java:662)
          >> Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xfc
          >> (at char #81, byte #-1)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.reportInvalidInitial(UTF8Reader.java:395)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:247)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:157)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:182)
          >> ... 13 more

          Detected by Allan Feid and documented on the user list http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CCAN94UWe6UvcOKT1S%2BXANC-sy0qFsxet3RJY9PVkj-eSfO5fk6Q%40mail.gmail.com%3E

          Steps:
          Send an event with the body as follows:
          foo¤data¤1371126476.436¤0.005¤555¤10.1.1.1¤HTTP/1.1¤GET¤http¤vhost¤/path/url¤¤-¤200¤
          referrer.com/search/?query=\x8D\x91\x89\xEF\x8Bc\x8E\x96\x93\xB0¤-¤-¤-

          Expected Results:
          The event is stored in elasticsearch.

          Actual Results:
          >> 10 Jun 2013 09:52:34,360 ERROR
          >> [SinkRunner-PollingRunner-DefaultSinkProcessor]
          >> (org.apache.flume.SinkRunner$PollingRunner.run:160) - Unable to deliver
          >> event. Exception follows.
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException:
          >> java.io.CharConversionException: Invalid UTF-8 start byte 0xfc (at char
          >> #81, byte #-1)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:198)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.<init>(StreamReader.java:62)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLParser.<init>(YAMLParser.java:147)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory._createParser(YAMLFactory.java:530)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory.createJsonParser(YAMLFactory.java:420)
          >> at
          >> org.elasticsearch.common.xcontent.yaml.YamlXContent.createParser(YamlXContent.java:83)
          >> at
          >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.addComplexField(ContentBuilderUtil.java:61)
          >> at
          >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.appendField(ContentBuilderUtil.java:47)
          >> at
          >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.appendBody(ElasticSearchLogStashEventSerializer.java:87)
          >> at
          >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.getContentBuilder(ElasticSearchLogStashEventSerializer.java:79)
          >> at
          >> org.apache.flume.sink.elasticsearch.ElasticSearchSink.process(ElasticSearchSink.java:178)
          >> at
          >> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
          >> at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
          >> at java.lang.Thread.run(Thread.java:662)
          >> Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xfc
          >> (at char #81, byte #-1)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.reportInvalidInitial(UTF8Reader.java:395)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:247)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:157)
          >> at
          >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:182)
          >> ... 13 more
          Hide
          Edward Sargisson added a comment -


          My colleague and I detected a similar failure which did not raise a YAMLException. The ElasticsearchSink is a little too aggressive in trying to assume everything is JSON. My colleague developed a more general patch which, if JSON parsing fails, catches Exception and writes the field as a simple field.

          Show
          Edward Sargisson added a comment - My colleague and I detected a similar failure which did not raise a YAMLException. The ElasticsearchSink is a little too aggressive in trying to assume everything is JSON. My colleague developed a more general patch which, if JSON parsing fails, catches Exception and writes the field as a simple field.
          Edward Sargisson made changes -
          Attachment flume-2089.diff [ 12597576 ]
          Edward Sargisson made changes -
          Summary ElasticsearchSink blocks raises exceptions when event body has unexpected encoding ElasticsearchSink blocks and raises exceptions when event body has unexpected encoding
          Edward Sargisson made changes -
          Remote Link This issue links to "Review Board (Web Link)" [ 12514 ]
          Edward Sargisson made changes -
          Release Note catch YAML parsing errors when adding data to elasticsearch ElasticsearchSink now handles event bodies with unexpected encodings or parse failures by storing them as simple fields
          Ashish Paliwal made changes -
          Assignee Ashish Paliwal [ paliwalashish ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          3d 22h 59m 1 Allan Feid 18/Jun/13 16:03

            People

            • Assignee:
              Ashish Paliwal
              Reporter:
              Edward Sargisson
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development