This is the umbrella Jira for FLIP39, which intents to to enhance the scalability and the ease of use of Flink ML.
Google Doc: (will convert it to an official confluence page very soon ) https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo
In machine learning, there are mainly two types of people. The first type is MLlib developer. They need a set of standard/well abstracted core ML APIs to implement the algorithms. Every ML algorithm is a certain concrete implementation on top of these APIs. The second type is MLlib users who utilize the existing/packaged MLlib to train or server a model. It is pretty common that the entire training or inference is constructed by a sequence of transformation or algorithms. It is essential to provide a workflow/pipeline API for MLlib users such that they can easily combine multiple algorithms to describe the ML workflow/pipeline.
Current Flink has a set of ML core inferences, but they are built on top of dataset API. This does not quite align with the latest flink roadmap (TableAPI will become the first class citizen and primary API for analytics use cases, while dataset API will be gradually deprecated). Moreover, Flink at present does not have any interface that allows MLlib users to describe an ML workflow/pipeline, nor provides any approach to persist pipeline or model and reuse them in the future. To solve/improve these issues, in this FLIP we propose to:
- Provide a new set of ML core interface (on top of Flink TableAPI)
- Provide a ML pipeline interface (on top of Flink TableAPI)
- Provide the interfaces for parameters management and pipeline persistence
- All the above interfaces should facilitate any new ML algorithm. We will gradually add various standard ML algorithms on top of these new proposed interfaces to ensure their feasibility and scalability.