Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-310

DynamoDB/Kinesis Change Capture using Delta Streamer

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • deltastreamer
    • None

    Description

      The goal here is to do CDC from DynamoDB and then have it be ingested into S3 as a Hudi dataset 

      Few resources: 

      1. DynamoDB Streams https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html  provides change capture logs in Kinesis. 
      2. Walkthrough https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html Code https://github.com/awslabs/dynamodb-streams-kinesis-adapter 
      3. Spark Streaming has support for reading Kinesis streams https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html one of the many resources showing how to change the Spark Kinesis example code to consume dynamodb stream   https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79
      4. In DeltaStreamer, we need to add some form of KinesisSource that returns a RDD with new data everytime `fetchNewData` is called https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java  . DeltaStreamer itself does not use Spark Streaming APIs
      5. Internally, we have Avro, Json, Row sources that extract data in these formats. 

      Open questions : 

      1. Should this just be a KinesisSource inside Hudi, that needs to be configured differently or do we need two sources: DynamoDBKinesisSource (that does some DynamoDB Stream specific setup/assumptions) and a plain KinesisSource. What's more valuable to do , if we have to pick one. 
      2. For Kafka integration, we just reused the KafkaRDD in Spark Streaming easily and avoided writing a lot of code by hand. Could we pull the same thing off for Kinesis? (probably needs digging through Spark code) 
      3. What's the format of the data for DynamoDB streams? 

       

       

      We should probably flesh these out before going ahead with implementation? 

       

      Attachments

        Issue Links

          Activity

            People

              vinaypatil18 Vinay
              vinoth Vinoth Chandar
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: