Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-613

Refactor and enhance the Transformer component

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Currently, Hudi has a component that has not been widely used: Transformer. As we all know, before the original data fell into the data lake, a very common operation is data preprocessing and ETL. This is also the most common use scenario of many computing engines, such as Flink and Spark. Now that Hudi has taken advantage of the power of the computing engine, it can also naturally take advantage of its ability of data preprocessing. We can refactor the Transformer to make it become more flexible. To summarize, we can refactor from the following aspects:

      • Decouple Transformer from Spark
      • Enrich the Transformer and provide built-in transformer
      • Support Transformer-chain

      For the first point, the Transformer interface is tightly coupled with Spark in design, and it contains a Spark-specific context. This makes it impossible for us to take advantage of the transform capabilities provided by other engines (such as Flink) after supporting multiple engines. Therefore, we need to decouple it from Spark in design.

      For the second point, we can enhance the Transformer and provide some out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer, and so on.

      For the third point, the most common pattern for data processing is the pipeline model, and the common implementation of the pipeline model is the responsibility chain model, which can be compared to the Apache commons chain[1], combining multiple Transformers can make data-processing become more flexible and expandable.

      If we enhance the capabilities of Transformer components, Hudi will provide richer data processing capabilities based on the computing engine.

      The relevant discussion thread is here: https://lists.apache.org/thread.html/rfad2e71fc432922ca567432b7b6e1dd9c3bb102822177b73dbff2d90%40%3Cdev.hudi.apache.org%3E

      Attachments

        Activity

          People

            Unassigned Unassigned
            yanghua vinoyang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: