Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1267

Additional Metadata Details for Hudi Transactions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.9.0
    • None
    • Usability, writer-core
    • None

    Description

      Whenever following scenarios happen :

      1. Custom Datasource ( Kafka for instance ) -> Hudi Table
      2. Hudi -> Hudi Table
      3. s3 -> Hudi Table

      Following metadata need to be captured :

      1. Table Level Metadata
        • Operation name ( record level ) like Upsert, Insert etc for last operation performed on the row
      1. Transaction Level Metadata ( This will be logged on Hudi Level and not Table Level )
        • Source ( Kafka Topic Name / S3 url for source data in case of s3 etc )
        • Target Hudi Table Name
        • Last transaction time ( last commit time )

      Basically , point (1) collects all details on table level  and point (2) collects all the transactions happened on Hudi Level

      Point(1) would be just a column addition for operation type

      Eg for Point (2) :  Suppose we had an ingestion from Kafka topic 'A' to Hudi table 'ingest_kafka' and another ingestion from RDBMS table ( 'tableA' ) through Sqoop to Hudi Table 'RDBMSingest' then the metadata captured would be :

       

      Source Timestamp Transaction Type Target
      Kafka - 'A' XXXXXX UPSERT ingest_kafka
      RDBMS - 'tableA' XXXXXX INSERT RDBMSingest

       

      The Transaction Details Table in Point (2) should be available as a separate common table which can be queried as Hudi Table or stored as parquet which can be queried from Spark

      Attachments

        Activity

          People

            Unassigned Unassigned
            ashishmg Ashish M G
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: