Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10324

MLlib 1.6 Roadmap



    • Umbrella
    • Status: Resolved
    • Blocker
    • Resolution: Done
    • None
    • 1.6.0
    • ML, MLlib
    • None


      Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated.


      For contributors:

      • Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important.
      • If you are a first-time Spark contributor, please always start with a starter task rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review.
      • Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned.
      • For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors.
      • Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another.
      • Remember to add `@Since("1.6.0")` annotation to new public APIs.
      • Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours.

      For committers:

      • Try to break down big features into small and specific JIRA tasks and link them properly.
      • Add "starter" label to starter tasks.
      • Put a rough estimate for medium/big features and track the progress.
      • If you start reviewing a PR, please add yourself to the Shepherd field on JIRA.
      • If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass.
      • After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary.

      Roadmap (WIP)

      This is NOT a complete list of MLlib JIRAs for 1.6. We only include umbrella JIRAs and high-level tasks.

      Algorithms and performance


      Pipeline API

      Model persistence

      Data sources

      Python API for ML

      The main goal of Python API is to have feature parity with Scala/Java API. You can find a complete list here. The tasks fall into two major categories:

      • Python API for new algorithms
      • Python API for missing methods (Some listed in SPARK-10022 and SPARK-9663)

      SparkR API for ML


      • re-organize user guide (SPARK-8517)
      • @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
      • automatically test example code in user guide (SPARK-11337)


        Issue Links



              mengxr Xiangrui Meng
              mengxr Xiangrui Meng
              0 Vote for this issue
              28 Start watching this issue