[SPARK-3507] Create RegressionLearner trait and make some currect code implement it - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: MLlib
Labels:
None

Target Version/s:

1.2.0

Description

Here in Yandex, during implementation of gradient boosting in spark and creating our ML tool for internal use, we found next serious problems in MLLib:

There is no Regression/Classification learner model abstraction. We were building abstract data processing pipelines, which should work just with some regression - exact algorithm specified outside this code. There is no abstraction, which will allow me to do that. (It's main reason for all further problems)
There is no common practice among MLlib for testing algorithms: every model generates it's own random test data. There is no easy extractable test cases applible to another algorithm. There is no benchmarks for comparing algorithms. After implementing new algorithm it's very hard to understand how it should be tested.
Lack of serialization testing: MLlib algorithms don't contain tests which test that model work after serialization.
During implementation of new algorithm it's hard to understand what API you should create and which interface to implement.
Start for solving all these problems must be done in creating common interface for typical algorithms/models - regression, classification, clustering, collaborative filtering.

All main tests should be written against these interfaces, so when new algorithm implemented - all it should do is passed already written tests. It allow us to have managble quality among all lib.

There should be couple benchmarks which allow new spark user to get feeling about which algorithm to use.

Test set against these abstractions should contain serialization test. In production most time there is no need in model, which can't be stored.

As the first step of this roadmap I'd like to create trait RegressionLearner, ADD methods to current algorithms to implement this trait and create some tests against it.

Attachments

Issue Links

relates to

SPARK-3702 Standardize MLlib classes for learners, models

Closed

links to

[Github] Pull Request #2371 (epahomov)

Activity

People

Assignee:: Egor Pakhomov

Reporter:: Egor Pakhomov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Due:: 20/Sep/14

Created:: 12/Sep/14 12:39

Updated:: 09/Oct/14 10:54

Resolved:: 09/Oct/14 10:54

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified