Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2220

ElasticSearch sink - duplicate fields in indexed document

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4.0
    • Fix Version/s: None
    • Component/s: None

      Description

      The default serializer for the ElasticSearch sink (ElasticSearchLogStashEventSerializer) duplicates fields that are mapped to default logstash fields.
      For instance timestamp, source, host. Those appear both as logstash fields ("@timestamp", "@source_host" etc.), and both as fields under the @fields ("@fields.timestamp", "@fields.host").
      When inserting a field from the headers as a logstash system field it should be removed from the dictionary so it wouldn't get written again under the "@fields" field.

        Issue Links

          Activity

          Hide
          dib.ghosh Dib Ghosh added a comment -

          Hi Rotem,

          This issue is due to the v0 logstash json schema used by Flume. Internally Flume's ElasticSearchSink adds @source and @source_host to mimic v0 logstash format. This should be resolved with migration to v1 json schema of Logstash. There is an open bug request on Flume for this one (https://issues.apache.org/jira/browse/FLUME-2099) and logstash documentation about the v0 schema problem here - https://logstash.jira.com/browse/LOGSTASH-675.

          To quote the issue from logstash bug list -
          "The current logstash json schema has a few problems:
          It uses two namespacing techniques when only one is needed ("@" prefixing, like "@source", and "@fields" object for another namespace)
          @source_host and @source_path duplicate @source."

          I am also linking your ticket to Flume-2099.

          Hope this helps,

          • Dib
          Show
          dib.ghosh Dib Ghosh added a comment - Hi Rotem, This issue is due to the v0 logstash json schema used by Flume. Internally Flume's ElasticSearchSink adds @source and @source_host to mimic v0 logstash format. This should be resolved with migration to v1 json schema of Logstash. There is an open bug request on Flume for this one ( https://issues.apache.org/jira/browse/FLUME-2099 ) and logstash documentation about the v0 schema problem here - https://logstash.jira.com/browse/LOGSTASH-675 . To quote the issue from logstash bug list - "The current logstash json schema has a few problems: It uses two namespacing techniques when only one is needed ("@" prefixing, like "@source", and "@fields" object for another namespace) @source_host and @source_path duplicate @source." I am also linking your ticket to Flume-2099. Hope this helps, Dib
          Hide
          rore Rotem Hermon added a comment -

          Hi Dib

          v1 will certainly makes the schema less noisy, but this issue is not due to the v0 schema, it's just seems like a bug in the sink. The serializer creates a map of headers, extracts some fields from this map and sets them as top fields, and then goes over all the items in the map and adds them under the "@fields" field. So the items that where extracted before and were already added as logstash fields are added again also under "@fields". This is redundant. Items from the map that where added should be removed from the map before doing the generic adding so they won't appear twice.

          Hope I managed to be clear. If I'll get to it I'll try to attach a code fix (still trying to understand the procedure of submitting code to an Apache project...).

          Show
          rore Rotem Hermon added a comment - Hi Dib v1 will certainly makes the schema less noisy, but this issue is not due to the v0 schema, it's just seems like a bug in the sink. The serializer creates a map of headers, extracts some fields from this map and sets them as top fields, and then goes over all the items in the map and adds them under the "@fields" field. So the items that where extracted before and were already added as logstash fields are added again also under "@fields". This is redundant. Items from the map that where added should be removed from the map before doing the generic adding so they won't appear twice. Hope I managed to be clear. If I'll get to it I'll try to attach a code fix (still trying to understand the procedure of submitting code to an Apache project...).
          Hide
          dib.ghosh Dib Ghosh added a comment -

          Hi Rotem,

          My apologies for misinterpreting your bug post. I also started on Flume commit process very recently.

          > Typically, you start off with assigning the ticket to yourself.
          > Make changes in your local flume copy
          > Test the changes
          > Create diff file of the patch once you have tested the changes
          > Goto https://reviews.apache.org, upload the diff file, tag the bug with JIRA and the group to Flume
          > Upload the patch to the JIRA ticket and mark the ticket as patch-available
          > Once someone reviews the patch and marks it ok to ship commit your patch to appropriate branches in git, in this case I think it should be trunk and Flume-1.5 branch in git

          You can find more details here (https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet) and here (https://cwiki.apache.org/confluence/display/FLUME/How+to+Contribute).

          Hope this helps,

          • Dib
          Show
          dib.ghosh Dib Ghosh added a comment - Hi Rotem, My apologies for misinterpreting your bug post. I also started on Flume commit process very recently. > Typically, you start off with assigning the ticket to yourself. > Make changes in your local flume copy > Test the changes > Create diff file of the patch once you have tested the changes > Goto https://reviews.apache.org , upload the diff file, tag the bug with JIRA and the group to Flume > Upload the patch to the JIRA ticket and mark the ticket as patch-available > Once someone reviews the patch and marks it ok to ship commit your patch to appropriate branches in git, in this case I think it should be trunk and Flume-1.5 branch in git You can find more details here ( https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet ) and here ( https://cwiki.apache.org/confluence/display/FLUME/How+to+Contribute ). Hope this helps, Dib
          Hide
          dib.ghosh Dib Ghosh added a comment - - edited

          Hi Rotem,

          Correction about the Flume commit process outline I posted in the previous comment:

          ----- CORRECTION (last step in the Flume commit process comment above) -----

          > Flume committer / contributor reviews the patch; marks it ok to ship; commits your patch to appropriate branches in git, in this case I think it should be trunk and Flume-1.5 branch in git.
          You WON'T be pushing the code from your local branch, someone with committer privilege can push the code.

          Show
          dib.ghosh Dib Ghosh added a comment - - edited Hi Rotem, Correction about the Flume commit process outline I posted in the previous comment: ----- CORRECTION (last step in the Flume commit process comment above) ----- > Flume committer / contributor reviews the patch; marks it ok to ship; commits your patch to appropriate branches in git, in this case I think it should be trunk and Flume-1.5 branch in git. You WON'T be pushing the code from your local branch, someone with committer privilege can push the code.
          Hide
          rore Rotem Hermon added a comment -

          Thanks Dib, I'll try to get to it soon.

          Show
          rore Rotem Hermon added a comment - Thanks Dib, I'll try to get to it soon.
          Hide
          rore Rotem Hermon added a comment -
          Show
          rore Rotem Hermon added a comment - OK, review link - https://reviews.apache.org/r/14993/
          Hide
          dib.ghosh Dib Ghosh added a comment - - edited

          Thanks Rotem for the patch. Looks fine to me. Now please wait for a flume contributor / committer to review it.

          Meanwhile, I downloaded the diff file from reviewboard and attached it with the JIRA ticket as per flume patch submission process. Hope you won't mind me uploading the patch to the JIRA. Also please assign the ticket to yourself Rotem and mark the JIRA ticket to patch available to make sure that flume community knows about the patch.

          Best,

          • dib
          Show
          dib.ghosh Dib Ghosh added a comment - - edited Thanks Rotem for the patch. Looks fine to me. Now please wait for a flume contributor / committer to review it. Meanwhile, I downloaded the diff file from reviewboard and attached it with the JIRA ticket as per flume patch submission process. Hope you won't mind me uploading the patch to the JIRA. Also please assign the ticket to yourself Rotem and mark the JIRA ticket to patch available to make sure that flume community knows about the patch. Best, dib
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Dib Ghosh - Unfortunately, the author himself has to attach the patch to the jira, since that is what grants the ASF the license to include the patch in ASF projects.

          Rotem Hermon - Please attach the patch to this jira

          Show
          hshreedharan Hari Shreedharan added a comment - Dib Ghosh - Unfortunately, the author himself has to attach the patch to the jira, since that is what grants the ASF the license to include the patch in ASF projects. Rotem Hermon - Please attach the patch to this jira
          Hide
          dib.ghosh Dib Ghosh added a comment -

          Hari Shreedharan - My apologies for the slip up.

          Show
          dib.ghosh Dib Ghosh added a comment - Hari Shreedharan - My apologies for the slip up.
          Hide
          rore Rotem Hermon added a comment - - edited

          Attached the patch.
          How do I assign the ticket to myself? I don't seem to have a way to set the assignee.
          Is there something else needed?

          Show
          rore Rotem Hermon added a comment - - edited Attached the patch. How do I assign the ticket to myself? I don't seem to have a way to set the assignee. Is there something else needed?
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Assigned to you. You should now be able to assign jiras to yourself, Rotem

          Show
          hshreedharan Hari Shreedharan added a comment - Assigned to you. You should now be able to assign jiras to yourself, Rotem
          Hide
          dib.ghosh Dib Ghosh added a comment -

          Rotem Hermon you need ticket assignment rights on JIRA to do it. Also, mark the JIRA to patch available once you are given the JIRA permission.

          Show
          dib.ghosh Dib Ghosh added a comment - Rotem Hermon you need ticket assignment rights on JIRA to do it. Also, mark the JIRA to patch available once you are given the JIRA permission.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Not a problem, sir. It is required so the required license/copyright etc are granted to ASF.

          Thanks,
          Hari

          Show
          hshreedharan Hari Shreedharan added a comment - Not a problem, sir. It is required so the required license/copyright etc are granted to ASF. Thanks, Hari
          Hide
          rore Rotem Hermon added a comment -

          Thanks.

          Show
          rore Rotem Hermon added a comment - Thanks.

            People

            • Assignee:
              rore Rotem Hermon
              Reporter:
              rore Rotem Hermon
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development