Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3507

Create RegressionLearner trait and make some currect code implement it

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None
    • Target Version/s:

      Description

      Here in Yandex, during implementation of gradient boosting in spark and creating our ML tool for internal use, we found next serious problems in MLLib:

      There is no Regression/Classification learner model abstraction. We were building abstract data processing pipelines, which should work just with some regression - exact algorithm specified outside this code. There is no abstraction, which will allow me to do that. (It's main reason for all further problems)
      There is no common practice among MLlib for testing algorithms: every model generates it's own random test data. There is no easy extractable test cases applible to another algorithm. There is no benchmarks for comparing algorithms. After implementing new algorithm it's very hard to understand how it should be tested.
      Lack of serialization testing: MLlib algorithms don't contain tests which test that model work after serialization.
      During implementation of new algorithm it's hard to understand what API you should create and which interface to implement.
      Start for solving all these problems must be done in creating common interface for typical algorithms/models - regression, classification, clustering, collaborative filtering.

      All main tests should be written against these interfaces, so when new algorithm implemented - all it should do is passed already written tests. It allow us to have managble quality among all lib.

      There should be couple benchmarks which allow new spark user to get feeling about which algorithm to use.

      Test set against these abstractions should contain serialization test. In production most time there is no need in model, which can't be stored.

      As the first step of this roadmap I'd like to create trait RegressionLearner, ADD methods to current algorithms to implement this trait and create some tests against it.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                epakhomov Egor Pakhomov
                Reporter:
                epakhomov Egor Pakhomov
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Due:
                  Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified