Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12470

FLIP39: Flink ML pipeline and ML libs

    XMLWordPrintableJSON

    Details

      Description

      This is the umbrella Jira for FLIP39, which intents to to enhance the scalability and the ease of use of Flink ML.

      ML Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html

      Google Doc: (will convert it to an official confluence page very soon ) https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo

      In machine learning, there are mainly two types of people. The first type is MLlib developer. They need a set of standard/well abstracted core ML APIs to implement the algorithms. Every ML algorithm is a certain concrete implementation on top of these APIs. The second type is MLlib users who utilize the existing/packaged MLlib to train or server a model.  It is pretty common that the entire training or inference is constructed by a sequence of transformation or algorithms. It is essential to provide a workflow/pipeline API for MLlib users such that they can easily combine multiple algorithms to describe the ML workflow/pipeline.

      Current Flink has a set of ML core inferences, but they are built on top of dataset API. This does not quite align with the latest flink roadmap (TableAPI will become the first class citizen and primary API for analytics use cases, while dataset API will be gradually deprecated). Moreover, Flink at present does not have any interface that allows MLlib users to describe an ML workflow/pipeline, nor provides any approach to persist pipeline or model and reuse them in the future. To solve/improve these issues, in this FLIP we propose to:

      • Provide a new set of ML core interface (on top of Flink TableAPI)
      • Provide a ML pipeline interface (on top of Flink TableAPI)
      • Provide the interfaces for parameters management and pipeline persistence
      • All the above interfaces should facilitate any new ML algorithm. We will gradually add various standard ML algorithms on top of these new proposed interfaces to ensure their feasibility and scalability.

        Attachments

          Issue Links

          1.
          Add the interface of ML pipeline and ML lib Sub-task Closed Luo Gen
          2.
          Remove the legacy flink-libraries/flink-ml Sub-task Closed Luo Gen
          3.
          Summarizer: summary statistics for Table Sub-task In Progress Xu Yang  
          4.
          ML common parameters Sub-task Closed Xu Yang
          5.
          Sparse and dense vector class, and dense matrix class with basic operations. Sub-task Resolved Xu Yang
          6.
          Add flink-ml-lib module Sub-task Closed Luo Gen
          7.
          Add more functionalities for ML Params and ParamInfo class Sub-task Closed Xu Yang
          8.
          Introduce FlinkML import/export framework Sub-task Open Luo Gen
          9.
          Add the algorithm of Fast Fourier Transformation(FFT) Sub-task In Progress Xu Yang
          10.
          Add unary loss functions Sub-task In Progress Xu Yang
          11.
          Add an implementation of pipeline's api Sub-task Resolved Xu Yang
          12.
          Add the Mapper and related classes for later algorithm implementations. Sub-task Resolved Xu Yang
          13.
          Add an util class to build result row and generate result schema. Sub-task Open Unassigned
          14.
          Add two utils for Table transformations. Sub-task Open Unassigned
          15.
          Add an implement of collector with the row type Sub-task Open Unassigned
          16.
          Add the utility class for the Table Sub-task Open Unassigned
          17.
          Add abstract classes for three typical scenarios of (Flat)Mapper. Sub-task Open Unassigned
          18.
          Add class for BinarizerMapper Sub-task Open Unassigned
          19.
          Add class for VectorAssemblerMapper Sub-task Open Unassigned
          20.
          Add class for VectorEleWiseProductMapper Sub-task Open Unassigned
          21.
          Add class for VectorInteractionMapper Sub-task Open Unassigned
          22.
          Add class of Vector Normalize Mapper Sub-task Open Unassigned
          23.
          Add class of Vector Size Hint Mapper Sub-task Open Unassigned
          24.
          Add class of Vector Slice Mapper Sub-task Open Unassigned
          25.
          Add class of Vector to Columns mapper Sub-task Open Unassigned
          26.
          Add Built-in vector types Sub-task Open Unassigned
          27.
          Add class for PolynomialExpansionMapper Sub-task Open Unassigned
          28.
          Add class for FeatureHasherMapper. Sub-task Open Unassigned
          29.
          Add the interface of ModelDataConverter, and several base classes that implement this interface. Sub-task Open Unassigned
          30.
          Add class for NLPConstant. Sub-task Open Unassigned  
          31.
          Add class for DocHashTFVectorizerMapper. Sub-task Open Unassigned
          32.
          Add several base classes of summarizer. Sub-task Open Unassigned
          33.
          Add class for NGramMapper. Sub-task Open Unassigned
          34.
          Add class for RegexTokenizerMapper. Sub-task Open Unassigned
          35.
          Add class for TokenizerMapper. Sub-task Open Unassigned
          36.
          Add summarizer and summary for table. Sub-task Open Unassigned
          37.
          Add summarizer and summary for sparse vector and dense vector. Sub-task Open Unassigned
          38.
          Add class for DocHashIDFVectorizerModelMapper. Sub-task Open Unassigned
          39.
          Add class for DocCountVectorizerMapper. Sub-task Open Unassigned
          40.
          Add to BLAS a method that performs DenseMatrix and SparseVector multiplication. Sub-task Resolved Unassigned
          41.
          Add the class for multivariate Gaussian Distribution. Sub-task Open Unassigned
          42.
          Add a wrapper class of a JSON library to provide the unified json format. Sub-task Open Unassigned
          43.
          Add an abstract class for mappers with rich model. Sub-task Open Unassigned
          44.
          Add the model mapper for Gaussian Mixture model. Sub-task Open Unassigned
          45.
          Add class for SqlOperators, and add sql operations to AlgoOperator, BatchOperator and StreamOperator. Sub-task Open Unassigned

            Activity

              People

              • Assignee:
                ShaoxuanWang Shaoxuan Wang
                Reporter:
                ShaoxuanWang Shaoxuan Wang
              • Votes:
                0 Vote for this issue
                Watchers:
                23 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 888h Original Estimate - 888h
                  888h
                  Remaining:
                  Time Spent - 9h 40m Remaining Estimate - 887h 10m
                  887h 10m
                  Logged:
                  Time Spent - 9h 40m Remaining Estimate - 887h 10m
                  9h 40m