Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.9.0
-
None
-
None
Description
Whenever following scenarios happen :
- Custom Datasource ( Kafka for instance ) -> Hudi Table
- Hudi -> Hudi Table
- s3 -> Hudi Table
Following metadata need to be captured :
- Table Level Metadata
-
- Operation name ( record level ) like Upsert, Insert etc for last operation performed on the row
- Transaction Level Metadata ( This will be logged on Hudi Level and not Table Level )
- Source ( Kafka Topic Name / S3 url for source data in case of s3 etc )
- Target Hudi Table Name
- Last transaction time ( last commit time )
Basically , point (1) collects all details on table level and point (2) collects all the transactions happened on Hudi Level
Point(1) would be just a column addition for operation type
Eg for Point (2) : Suppose we had an ingestion from Kafka topic 'A' to Hudi table 'ingest_kafka' and another ingestion from RDBMS table ( 'tableA' ) through Sqoop to Hudi Table 'RDBMSingest' then the metadata captured would be :
Source | Timestamp | Transaction Type | Target |
Kafka - 'A' | XXXXXX | UPSERT | ingest_kafka |
RDBMS - 'tableA' | XXXXXX | INSERT | RDBMSingest |
The Transaction Details Table in Point (2) should be available as a separate common table which can be queried as Hudi Table or stored as parquet which can be queried from Spark