Flume
  1. Flume
  2. FLUME-2089

ElasticsearchSink blocks and raises exceptions when event body has unexpected encoding

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: v1.4.0, v1.3.1
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:
      None
    • Release Note:
      ElasticsearchSink now handles event bodies with unexpected encodings or parse failures by storing them as simple fields

      Description

      Detected by Allan Feid and documented on the user list http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CCAN94UWe6UvcOKT1S%2BXANC-sy0qFsxet3RJY9PVkj-eSfO5fk6Q%40mail.gmail.com%3E

      Steps:
      Send an event with the body as follows:
      foo¤data¤1371126476.436¤0.005¤555¤10.1.1.1¤HTTP/1.1¤GET¤http¤vhost¤/path/url¤¤-¤200¤
      referrer.com/search/?query=\x8D\x91\x89\xEF\x8Bc\x8E\x96\x93\xB0¤¤¤-

      Expected Results:
      The event is stored in elasticsearch.

      Actual Results:
      >> 10 Jun 2013 09:52:34,360 ERROR
      >> [SinkRunner-PollingRunner-DefaultSinkProcessor]
      >> (org.apache.flume.SinkRunner$PollingRunner.run:160) - Unable to deliver
      >> event. Exception follows.
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException:
      >> java.io.CharConversionException: Invalid UTF-8 start byte 0xfc (at char
      >> #81, byte #-1)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:198)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.<init>(StreamReader.java:62)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLParser.<init>(YAMLParser.java:147)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory._createParser(YAMLFactory.java:530)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.YAMLFactory.createJsonParser(YAMLFactory.java:420)
      >> at
      >> org.elasticsearch.common.xcontent.yaml.YamlXContent.createParser(YamlXContent.java:83)
      >> at
      >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.addComplexField(ContentBuilderUtil.java:61)
      >> at
      >> org.apache.flume.sink.elasticsearch.ContentBuilderUtil.appendField(ContentBuilderUtil.java:47)
      >> at
      >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.appendBody(ElasticSearchLogStashEventSerializer.java:87)
      >> at
      >> org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer.getContentBuilder(ElasticSearchLogStashEventSerializer.java:79)
      >> at
      >> org.apache.flume.sink.elasticsearch.ElasticSearchSink.process(ElasticSearchSink.java:178)
      >> at
      >> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
      >> at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
      >> at java.lang.Thread.run(Thread.java:662)
      >> Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xfc
      >> (at char #81, byte #-1)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.reportInvalidInitial(UTF8Reader.java:395)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:247)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.UTF8Reader.read(UTF8Reader.java:157)
      >> at
      >> org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.reader.StreamReader.update(StreamReader.java:182)
      >> ... 13 more

      1. flume-2089.diff
        3 kB
        Edward Sargisson

        Issue Links

          Activity

          Hide
          Allan Feid added a comment - - edited
          diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml
          index 1632c23..33c296e 100644
          --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml
          +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml
          @@ -54,6 +54,7 @@
                   <groupId>org.elasticsearch</groupId>
                   <artifactId>elasticsearch</artifactId>
                   <optional>true</optional>
          +        <version>0.90.1</version>
               </dependency>
          
               <dependency>
          diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java
          index bf7c57c..a00aa18 100644
          --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java
          +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java
          @@ -23,6 +23,7 @@ import java.io.IOException;
           import java.nio.charset.Charset;
          
           import org.elasticsearch.common.jackson.core.JsonParseException;
          +import org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException;
           import org.elasticsearch.common.xcontent.XContentBuilder;
           import org.elasticsearch.common.xcontent.XContentFactory;
           import org.elasticsearch.common.xcontent.XContentParser;
          @@ -67,6 +68,9 @@ public class ContentBuilderUtil {
                 // can't be figured out in the body. At this point just push it through
                 // as is, we have already added the field so don't do it again
                 addSimpleField(builder, fieldName, data);
          +    } catch (YAMLException ex) {
          +      // Same as above, except for YAML based parsing problems
          +      addSimpleField(builder, fieldName, data);
               } finally {
                 if (parser != null) {
                   parser.close();
          
          Show
          Allan Feid added a comment - - edited diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml index 1632c23..33c296e 100644 --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/pom.xml @@ -54,6 +54,7 @@ <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <optional>true</optional> + <version>0.90.1</version> </dependency> <dependency> diff --git a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java index bf7c57c..a00aa18 100644 --- a/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java +++ b/flume-ng-sinks/flume-ng-elasticsearch-sink/src/main/java/org/apache/flume/sink/elasticsearch/ContentBuilderUtil.java @@ -23,6 +23,7 @@ import java.io.IOException; import java.nio.charset.Charset; import org.elasticsearch.common.jackson.core.JsonParseException; +import org.elasticsearch.common.jackson.dataformat.yaml.snakeyaml.error.YAMLException; import org.elasticsearch.common.xcontent.XContentBuilder; import org.elasticsearch.common.xcontent.XContentFactory; import org.elasticsearch.common.xcontent.XContentParser; @@ -67,6 +68,9 @@ public class ContentBuilderUtil { // can't be figured out in the body. At this point just push it through // as is, we have already added the field so don't do it again addSimpleField(builder, fieldName, data); + } catch (YAMLException ex) { + // Same as above, except for YAML based parsing problems + addSimpleField(builder, fieldName, data); } finally { if (parser != null) { parser.close();
          Hide
          Edward Sargisson added a comment -

          Hi Allan,
          Thank you for the patch - I'm really glad we've got it.

          A question for you: is this not mostly a character set encoding issue? You would know from your data - which is why I'm asking.
          i.e. ContentBuilderUtil in addSimpleField() uses the platform's default character set - which may not be appropriate for all data.
          I'm wondering if we need a parameter to the ElasticSearchSink to allow this to be set.

          Your thoughts?

          Secondly, a quibble. The es 0.90.1 upgrade is in FLUME-2049.

          Lastly, once you're happy, perhaps you could put your patch into a review board at reviews.apache.org and set the Group to Flume. That will progress it along.

          Thanks again!

          Show
          Edward Sargisson added a comment - Hi Allan, Thank you for the patch - I'm really glad we've got it. A question for you: is this not mostly a character set encoding issue? You would know from your data - which is why I'm asking. i.e. ContentBuilderUtil in addSimpleField() uses the platform's default character set - which may not be appropriate for all data. I'm wondering if we need a parameter to the ElasticSearchSink to allow this to be set. Your thoughts? Secondly, a quibble. The es 0.90.1 upgrade is in FLUME-2049 . Lastly, once you're happy, perhaps you could put your patch into a review board at reviews.apache.org and set the Group to Flume. That will progress it along. Thanks again!
          Hide
          Allan Feid added a comment -

          This probably has something to do with character encoding, but as far as I know all logs are utf-8 encoded. The problem is that there are 100s of log sources which generate logs based on web traffic, among other things. I can't absolutely be sure there isn't some random encoded characters floating around. It is interesting that the xContentType method in ES detects some of my events as YAML though, I can be fairly confident there's no YAML in my log messages.

          I'll look into adding a patch on reviews.apache.org, and be sure to remove the 0.90.1 upgrade.

          Show
          Allan Feid added a comment - This probably has something to do with character encoding, but as far as I know all logs are utf-8 encoded. The problem is that there are 100s of log sources which generate logs based on web traffic, among other things. I can't absolutely be sure there isn't some random encoded characters floating around. It is interesting that the xContentType method in ES detects some of my events as YAML though, I can be fairly confident there's no YAML in my log messages. I'll look into adding a patch on reviews.apache.org, and be sure to remove the 0.90.1 upgrade.
          Hide
          Edward Sargisson added a comment -

          Allan, one more thing. Would you be able to add a unit test for this new functionality?

          Show
          Edward Sargisson added a comment - Allan, one more thing. Would you be able to add a unit test for this new functionality?
          Hide
          Mike Percy added a comment -

          Yes, please attach the patch to the JIRA and add a unit test. I'd like to pull this into Flume 1.4, we will cut an RC very soon. Thanks Allan!

          Show
          Mike Percy added a comment - Yes, please attach the patch to the JIRA and add a unit test. I'd like to pull this into Flume 1.4, we will cut an RC very soon. Thanks Allan!
          Hide
          Mike Percy added a comment -

          Sorry gents, looks like this one is going to miss the 1.4.0 boat.

          Show
          Mike Percy added a comment - Sorry gents, looks like this one is going to miss the 1.4.0 boat.
          Hide
          Edward Sargisson added a comment -


          My colleague and I detected a similar failure which did not raise a YAMLException. The ElasticsearchSink is a little too aggressive in trying to assume everything is JSON. My colleague developed a more general patch which, if JSON parsing fails, catches Exception and writes the field as a simple field.

          Show
          Edward Sargisson added a comment - My colleague and I detected a similar failure which did not raise a YAMLException. The ElasticsearchSink is a little too aggressive in trying to assume everything is JSON. My colleague developed a more general patch which, if JSON parsing fails, catches Exception and writes the field as a simple field.

            People

            • Assignee:
              Ashish Paliwal
              Reporter:
              Edward Sargisson
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development