Uploaded image for project: 'Camel'
  1. Camel
  2. CAMEL-16594

DynamoDB stream updates are missed when there are more than one active shards

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.12.0
    • camel-aws
    • None
    • Advanced

    Description

      The current Camel ddbstream implementation seems to incorrectly apply the concept of ShardIteratorType to the list of shards forming a DynamoDB stream rather than each shard individually.

      According to the AWS documentation:

      ShardIteratorType determines how the shard iterator is used to start reading stream records from the shard.
      

      For example, for a given shard, when ShardIteratorType equal to LATEST, the AWS SDK will read the most recent data in that particular shard. However, when ShardIteratorType equal to LATEST, Camel will additionally use ShardIteratorType to determine which shard it considers amongst all the available ones in the stream: https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132

      If my understanding is correct, shards in DynamoDB are modelled as a tree, with the child leaf nodes being the shards that are still active, i.e. the ones where new stream data will appear. These child shards will have a StartingSequenceNumber, but no EndingSequenceNumber.

      The most common case is to have a single shard, or a single branch of parent and child nodes:

      Shard0
         |
      Shard1
      

      In the above case, new data will be added to Shard1, and the Camel implementation which looks only at the last shard when ShardIteratorType is equal to LATEST, will be correct.

      However, the tree can also look like this (see related example in the attached JSON output from the AWS CLI, where the shard number matches the index in the JSON list):

                   Shard0
                  /      \
           Shard1          Shard2
          /      \        /      \ 
      Shard3   Shard4  Shard5   Shard6
      

      In this case, Camel will only consider Shard6, even though new data may be added to any of Shard3, Shard4, Shard5 or Shard6. This leads to updates being missed.

      As far as I can tell, DynamoDB will split into multiple shards depending on the number of table partitions, which will either grow for a table with huge amounts of data, or when an exiting table with provisioned capacity is migrated to on-demand provisioning.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            acosentino Andrea Cosentino
            Pyves Pierre-Yves Bigourdan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment