Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The goal here is to do CDC from DynamoDB and then have it be ingested into S3 as a Hudi dataset
Few resources:
- DynamoDB Streams https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html provides change capture logs in Kinesis.
- Walkthrough https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html Code https://github.com/awslabs/dynamodb-streams-kinesis-adapter
- Spark Streaming has support for reading Kinesis streams https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html one of the many resources showing how to change the Spark Kinesis example code to consume dynamodb stream https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79
- In DeltaStreamer, we need to add some form of KinesisSource that returns a RDD with new data everytime `fetchNewData` is called https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java . DeltaStreamer itself does not use Spark Streaming APIs
- Internally, we have Avro, Json, Row sources that extract data in these formats.
Open questions :
- Should this just be a KinesisSource inside Hudi, that needs to be configured differently or do we need two sources: DynamoDBKinesisSource (that does some DynamoDB Stream specific setup/assumptions) and a plain KinesisSource. What's more valuable to do , if we have to pick one.
- For Kafka integration, we just reused the KafkaRDD in Spark Streaming easily and avoided writing a lot of code by hand. Could we pull the same thing off for Kinesis? (probably needs digging through Spark code)
- What's the format of the data for DynamoDB streams?
We should probably flesh these out before going ahead with implementation?
Attachments
Issue Links
- is duplicated by
-
HUDI-1386 AWS kinesis data source for DeltaStreamer
- Closed