Here in Yandex, during implementation of gradient boosting in spark and creating our ML tool for internal use, we found next serious problems in MLLib:
There is no Regression/Classification learner model abstraction. We were building abstract data processing pipelines, which should work just with some regression - exact algorithm specified outside this code. There is no abstraction, which will allow me to do that. (It's main reason for all further problems)
There is no common practice among MLlib for testing algorithms: every model generates it's own random test data. There is no easy extractable test cases applible to another algorithm. There is no benchmarks for comparing algorithms. After implementing new algorithm it's very hard to understand how it should be tested.
Lack of serialization testing: MLlib algorithms don't contain tests which test that model work after serialization.
During implementation of new algorithm it's hard to understand what API you should create and which interface to implement.
Start for solving all these problems must be done in creating common interface for typical algorithms/models - regression, classification, clustering, collaborative filtering.
All main tests should be written against these interfaces, so when new algorithm implemented - all it should do is passed already written tests. It allow us to have managble quality among all lib.
There should be couple benchmarks which allow new spark user to get feeling about which algorithm to use.
Test set against these abstractions should contain serialization test. In production most time there is no need in model, which can't be stored.
As the first step of this roadmap I'd like to create trait RegressionLearner, ADD methods to current algorithms to implement this trait and create some tests against it.