[FLINK-12470] FLIP39: Flink ML pipeline and ML libs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Not a Priority
Resolution: Won't Do
Affects Version/s: 1.9.0
Fix Version/s: None
Component/s: Library / Machine Learning
Labels:

Description

This is the umbrella Jira for FLIP39, which intents to to enhance the scalability and the ease of use of Flink ML.

ML Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html

Google Doc: (will convert it to an official confluence page very soon ) https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo

In machine learning, there are mainly two types of people. The first type is MLlib developer. They need a set of standard/well abstracted core ML APIs to implement the algorithms. Every ML algorithm is a certain concrete implementation on top of these APIs. The second type is MLlib users who utilize the existing/packaged MLlib to train or server a model. It is pretty common that the entire training or inference is constructed by a sequence of transformation or algorithms. It is essential to provide a workflow/pipeline API for MLlib users such that they can easily combine multiple algorithms to describe the ML workflow/pipeline.

Current Flink has a set of ML core inferences, but they are built on top of dataset API. This does not quite align with the latest flink roadmap (TableAPI will become the first class citizen and primary API for analytics use cases, while dataset API will be gradually deprecated). Moreover, Flink at present does not have any interface that allows MLlib users to describe an ML workflow/pipeline, nor provides any approach to persist pipeline or model and reuse them in the future. To solve/improve these issues, in this FLIP we propose to:

Provide a new set of ML core interface (on top of Flink TableAPI)
Provide a ML pipeline interface (on top of Flink TableAPI)
Provide the interfaces for parameters management and pipeline persistence
All the above interfaces should facilitate any new ML algorithm. We will gradually add various standard ML algorithms on top of these new proposed interfaces to ensure their feasibility and scalability.

Attachments

Issue Links

duplicates

FLINK-11095 Table based ML Pipeline

Closed

mentioned in: Page Loading...

Sub-Tasks

1.

Add the interface of ML pipeline and ML lib

Closed

Luo Gen

0%

Original Estimate - 168h

Remaining Estimate - 167h 20m

2.

Remove the legacy flink-libraries/flink-ml

Closed

Luo Gen

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

3.

Summarizer: summary statistics for Table

Closed

Unassigned

4.

ML common parameters

Closed

Xu Yang

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

5.

Sparse and dense vector class, and dense matrix class with basic operations.

Resolved

Xu Yang

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

6.

Add flink-ml-lib module

Closed

Luo Gen

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

7.

Add more functionalities for ML Params and ParamInfo class

Closed

Xu Yang

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

8.

Introduce FlinkML import/export framework

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

9.

Add the algorithm of Fast Fourier Transformation(FFT)

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

10.

Add unary loss functions

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

11.

Add an implementation of pipeline's api

Resolved

Xu Yang

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

12.

Add the Mapper and related classes for later algorithm implementations.

Resolved

Xu Yang

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

13.

Add an util class to build result row and generate result schema.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

14.

Add two utils for Table transformations.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

15.

Add an implement of collector with the row type

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

16.

Add the utility class for the Table

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

17.

Add abstract classes for three typical scenarios of (Flat)Mapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

18.

Add class for BinarizerMapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

19.

Add class for VectorAssemblerMapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

20.

Add class for VectorEleWiseProductMapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

21.

Add class for VectorInteractionMapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

22.

Add class of Vector Normalize Mapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

23.

Add class of Vector Size Hint Mapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

24.

Add class of Vector Slice Mapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

25.

Add class of Vector to Columns mapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

26.

Add Built-in vector types

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

27.

Add class for PolynomialExpansionMapper

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

28.

Add class for FeatureHasherMapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

29.

Add the interface of ModelDataConverter, and several base classes that implement this interface.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

30.

Add class for NLPConstant.

Closed

Unassigned

31.

Add class for DocHashTFVectorizerMapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

32.

Add several base classes of summarizer.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

33.

Add class for NGramMapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

34.

Add class for RegexTokenizerMapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

35.

Add class for TokenizerMapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

36.

Add summarizer and summary for table.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

37.

Add summarizer and summary for sparse vector and dense vector.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

38.

Add class for DocHashIDFVectorizerModelMapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

39.

Add class for DocCountVectorizerMapper.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

40.

Add to BLAS a method that performs DenseMatrix and SparseVector multiplication.

Resolved

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

41.

Add the class for multivariate Gaussian Distribution.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

42.

Add a wrapper class of a JSON library to provide the unified json format.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

43.

Add an abstract class for mappers with rich model.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

44.

Add the model mapper for Gaussian Mixture model.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

45.

Add class for SqlOperators, and add sql operations to AlgoOperator, BatchOperator and StreamOperator.

Closed

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

Activity

People

Assignee:: Unassigned

Reporter:: Shaoxuan Wang

Votes:: 0 Vote for this issue

Watchers:: 25 Start watching this issue

Dates

Created:: 10/May/19 02:23

Updated:: 19/Apr/23 01:39

Resolved:: 19/Apr/23 01:39

Time Tracking

Estimated:

888h

Remaining:

887h 10m

Logged:

9h 40m

Include sub-tasks