Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12470

FLIP39: Flink ML pipeline and ML libs

    XMLWordPrintableJSON

Details

    Description

      This is the umbrella Jira for FLIP39, which intents to to enhance the scalability and the ease of use of Flink ML.

      ML Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html

      Google Doc: (will convert it to an official confluence page very soon ) https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo

      In machine learning, there are mainly two types of people. The first type is MLlib developer. They need a set of standard/well abstracted core ML APIs to implement the algorithms. Every ML algorithm is a certain concrete implementation on top of these APIs. The second type is MLlib users who utilize the existing/packaged MLlib to train or server a model.  It is pretty common that the entire training or inference is constructed by a sequence of transformation or algorithms. It is essential to provide a workflow/pipeline API for MLlib users such that they can easily combine multiple algorithms to describe the ML workflow/pipeline.

      Current Flink has a set of ML core inferences, but they are built on top of dataset API. This does not quite align with the latest flink roadmap (TableAPI will become the first class citizen and primary API for analytics use cases, while dataset API will be gradually deprecated). Moreover, Flink at present does not have any interface that allows MLlib users to describe an ML workflow/pipeline, nor provides any approach to persist pipeline or model and reuse them in the future. To solve/improve these issues, in this FLIP we propose to:

      • Provide a new set of ML core interface (on top of Flink TableAPI)
      • Provide a ML pipeline interface (on top of Flink TableAPI)
      • Provide the interfaces for parameters management and pipeline persistence
      • All the above interfaces should facilitate any new ML algorithm. We will gradually add various standard ML algorithms on top of these new proposed interfaces to ensure their feasibility and scalability.

      Attachments

        Issue Links

          1.
          Add the interface of ML pipeline and ML lib Sub-task Closed Luo Gen
          2.
          Remove the legacy flink-libraries/flink-ml Sub-task Closed Luo Gen
          3.
          Summarizer: summary statistics for Table Sub-task Closed Unassigned  
          4.
          ML common parameters Sub-task Closed Xu Yang
          5.
          Sparse and dense vector class, and dense matrix class with basic operations. Sub-task Resolved Xu Yang
          6.
          Add flink-ml-lib module Sub-task Closed Luo Gen
          7.
          Add more functionalities for ML Params and ParamInfo class Sub-task Closed Xu Yang
          8.
          Introduce FlinkML import/export framework Sub-task Closed Unassigned
          9.
          Add the algorithm of Fast Fourier Transformation(FFT) Sub-task Closed Unassigned
          10.
          Add unary loss functions Sub-task Closed Unassigned
          11.
          Add an implementation of pipeline's api Sub-task Resolved Xu Yang
          12.
          Add the Mapper and related classes for later algorithm implementations. Sub-task Resolved Xu Yang
          13.
          Add an util class to build result row and generate result schema. Sub-task Closed Unassigned
          14.
          Add two utils for Table transformations. Sub-task Closed Unassigned
          15.
          Add an implement of collector with the row type Sub-task Closed Unassigned
          16.
          Add the utility class for the Table Sub-task Closed Unassigned
          17.
          Add abstract classes for three typical scenarios of (Flat)Mapper. Sub-task Closed Unassigned
          18.
          Add class for BinarizerMapper Sub-task Closed Unassigned
          19.
          Add class for VectorAssemblerMapper Sub-task Closed Unassigned
          20.
          Add class for VectorEleWiseProductMapper Sub-task Closed Unassigned
          21.
          Add class for VectorInteractionMapper Sub-task Closed Unassigned
          22.
          Add class of Vector Normalize Mapper Sub-task Closed Unassigned
          23.
          Add class of Vector Size Hint Mapper Sub-task Closed Unassigned
          24.
          Add class of Vector Slice Mapper Sub-task Closed Unassigned
          25.
          Add class of Vector to Columns mapper Sub-task Closed Unassigned
          26.
          Add Built-in vector types Sub-task Closed Unassigned
          27.
          Add class for PolynomialExpansionMapper Sub-task Closed Unassigned
          28.
          Add class for FeatureHasherMapper. Sub-task Closed Unassigned
          29.
          Add the interface of ModelDataConverter, and several base classes that implement this interface. Sub-task Closed Unassigned
          30.
          Add class for NLPConstant. Sub-task Closed Unassigned  
          31.
          Add class for DocHashTFVectorizerMapper. Sub-task Closed Unassigned
          32.
          Add several base classes of summarizer. Sub-task Closed Unassigned
          33.
          Add class for NGramMapper. Sub-task Closed Unassigned
          34.
          Add class for RegexTokenizerMapper. Sub-task Closed Unassigned
          35.
          Add class for TokenizerMapper. Sub-task Closed Unassigned
          36.
          Add summarizer and summary for table. Sub-task Closed Unassigned
          37.
          Add summarizer and summary for sparse vector and dense vector. Sub-task Closed Unassigned
          38.
          Add class for DocHashIDFVectorizerModelMapper. Sub-task Closed Unassigned
          39.
          Add class for DocCountVectorizerMapper. Sub-task Closed Unassigned
          40.
          Add to BLAS a method that performs DenseMatrix and SparseVector multiplication. Sub-task Resolved Unassigned
          41.
          Add the class for multivariate Gaussian Distribution. Sub-task Closed Unassigned
          42.
          Add a wrapper class of a JSON library to provide the unified json format. Sub-task Closed Unassigned
          43.
          Add an abstract class for mappers with rich model. Sub-task Closed Unassigned
          44.
          Add the model mapper for Gaussian Mixture model. Sub-task Closed Unassigned
          45.
          Add class for SqlOperators, and add sql operations to AlgoOperator, BatchOperator and StreamOperator. Sub-task Closed Unassigned

          Activity

            People

              Unassigned Unassigned
              ShaoxuanWang Shaoxuan Wang
              Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 888h Original Estimate - 888h
                  888h
                  Remaining:
                  Time Spent - 9h 40m Remaining Estimate - 887h 10m
                  887h 10m
                  Logged:
                  Time Spent - 9h 40m Remaining Estimate - 887h 10m
                  9h 40m