Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12470

FLIP39: Flink ML pipeline and ML libs

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      This is the umbrella Jira for FLIP39, which intents to to enhance the scalability and the ease of use of Flink ML.

      ML Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html

      Google Doc: (will convert it to an official confluence page very soon ) https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo

      In machine learning, there are mainly two types of people. The first type is MLlib developer. They need a set of standard/well abstracted core ML APIs to implement the algorithms. Every ML algorithm is a certain concrete implementation on top of these APIs. The second type is MLlib users who utilize the existing/packaged MLlib to train or server a model.  It is pretty common that the entire training or inference is constructed by a sequence of transformation or algorithms. It is essential to provide a workflow/pipeline API for MLlib users such that they can easily combine multiple algorithms to describe the ML workflow/pipeline.

      Current Flink has a set of ML core inferences, but they are built on top of dataset API. This does not quite align with the latest flink roadmap (TableAPI will become the first class citizen and primary API for analytics use cases, while dataset API will be gradually deprecated). Moreover, Flink at present does not have any interface that allows MLlib users to describe an ML workflow/pipeline, nor provides any approach to persist pipeline or model and reuse them in the future. To solve/improve these issues, in this FLIP we propose to:

      • Provide a new set of ML core interface (on top of Flink TableAPI)
      • Provide a ML pipeline interface (on top of Flink TableAPI)
      • Provide the interfaces for parameters management and pipeline persistence
      • All the above interfaces should facilitate any new ML algorithm. We will gradually add various standard ML algorithms on top of these new proposed interfaces to ensure their feasibility and scalability.

      Attachments

        Issue Links

        1.
        Add the interface of ML pipeline and ML lib Sub-task Closed Luo Gen Actions
        2.
        Remove the legacy flink-libraries/flink-ml Sub-task Closed Luo Gen Actions
        3.
        Summarizer: summary statistics for Table Sub-task Closed Unassigned   Actions
        4.
        ML common parameters Sub-task Closed Xu Yang Actions
        5.
        Sparse and dense vector class, and dense matrix class with basic operations. Sub-task Resolved Xu Yang Actions
        6.
        Add flink-ml-lib module Sub-task Closed Luo Gen Actions
        7.
        Add more functionalities for ML Params and ParamInfo class Sub-task Closed Xu Yang Actions
        8.
        Introduce FlinkML import/export framework Sub-task Closed Unassigned Actions
        9.
        Add the algorithm of Fast Fourier Transformation(FFT) Sub-task Closed Unassigned Actions
        10.
        Add unary loss functions Sub-task Closed Unassigned Actions
        11.
        Add an implementation of pipeline's api Sub-task Resolved Xu Yang Actions
        12.
        Add the Mapper and related classes for later algorithm implementations. Sub-task Resolved Xu Yang Actions
        13.
        Add an util class to build result row and generate result schema. Sub-task Closed Unassigned Actions
        14.
        Add two utils for Table transformations. Sub-task Closed Unassigned Actions
        15.
        Add an implement of collector with the row type Sub-task Closed Unassigned Actions
        16.
        Add the utility class for the Table Sub-task Closed Unassigned Actions
        17.
        Add abstract classes for three typical scenarios of (Flat)Mapper. Sub-task Closed Unassigned Actions
        18.
        Add class for BinarizerMapper Sub-task Closed Unassigned Actions
        19.
        Add class for VectorAssemblerMapper Sub-task Closed Unassigned Actions
        20.
        Add class for VectorEleWiseProductMapper Sub-task Closed Unassigned Actions
        21.
        Add class for VectorInteractionMapper Sub-task Closed Unassigned Actions
        22.
        Add class of Vector Normalize Mapper Sub-task Closed Unassigned Actions
        23.
        Add class of Vector Size Hint Mapper Sub-task Closed Unassigned Actions
        24.
        Add class of Vector Slice Mapper Sub-task Closed Unassigned Actions
        25.
        Add class of Vector to Columns mapper Sub-task Closed Unassigned Actions
        26.
        Add Built-in vector types Sub-task Closed Unassigned Actions
        27.
        Add class for PolynomialExpansionMapper Sub-task Closed Unassigned Actions
        28.
        Add class for FeatureHasherMapper. Sub-task Closed Unassigned Actions
        29.
        Add the interface of ModelDataConverter, and several base classes that implement this interface. Sub-task Closed Unassigned Actions
        30.
        Add class for NLPConstant. Sub-task Closed Unassigned   Actions
        31.
        Add class for DocHashTFVectorizerMapper. Sub-task Closed Unassigned Actions
        32.
        Add several base classes of summarizer. Sub-task Closed Unassigned Actions
        33.
        Add class for NGramMapper. Sub-task Closed Unassigned Actions
        34.
        Add class for RegexTokenizerMapper. Sub-task Closed Unassigned Actions
        35.
        Add class for TokenizerMapper. Sub-task Closed Unassigned Actions
        36.
        Add summarizer and summary for table. Sub-task Closed Unassigned Actions
        37.
        Add summarizer and summary for sparse vector and dense vector. Sub-task Closed Unassigned Actions
        38.
        Add class for DocHashIDFVectorizerModelMapper. Sub-task Closed Unassigned Actions
        39.
        Add class for DocCountVectorizerMapper. Sub-task Closed Unassigned Actions
        40.
        Add to BLAS a method that performs DenseMatrix and SparseVector multiplication. Sub-task Resolved Unassigned Actions
        41.
        Add the class for multivariate Gaussian Distribution. Sub-task Closed Unassigned Actions
        42.
        Add a wrapper class of a JSON library to provide the unified json format. Sub-task Closed Unassigned Actions
        43.
        Add an abstract class for mappers with rich model. Sub-task Closed Unassigned Actions
        44.
        Add the model mapper for Gaussian Mixture model. Sub-task Closed Unassigned Actions
        45.
        Add class for SqlOperators, and add sql operations to AlgoOperator, BatchOperator and StreamOperator. Sub-task Closed Unassigned Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            ShaoxuanWang Shaoxuan Wang
            Votes:
            0 Vote for this issue
            Watchers:
            25 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 888h Original Estimate - 888h
                888h
                Remaining:
                Time Spent - 9h 40m Remaining Estimate - 887h 10m
                887h 10m
                Logged:
                Time Spent - 9h 40m Remaining Estimate - 887h 10m
                9h 40m

                Slack

                  Issue deployment