Details

    • Type: Brainstorming
    • Status: Closed
    • Priority: Minor
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML
    • Labels:

      Description

      I want to use this JIRA to collect some GSoC project ideas for MLlib. Ideally, the student should have contributed to Spark. And the content of the project could be divided into small functional pieces so that it won't get stalled if the mentor is temporarily unavailable.

        Issue Links

          Activity

          Hide
          mlnick Nick Pentreath added a comment -

          Do we want to focus on work within core, or also encourage projects that build on core / ML and reside in Spark packages?

          Show
          mlnick Nick Pentreath added a comment - Do we want to focus on work within core, or also encourage projects that build on core / ML and reside in Spark packages?
          Hide
          vectorijk Kai Jiang added a comment -

          Based on previous projects and ideas on this list, I think most projects will work within their corresponding project codebase. Since the GSoC project is still under Apache, it is better for us to focus on work within Spark codebase.

          Show
          vectorijk Kai Jiang added a comment - Based on previous projects and ideas on this list , I think most projects will work within their corresponding project codebase. Since the GSoC project is still under Apache, it is better for us to focus on work within Spark codebase.
          Hide
          vectorijk Kai Jiang added a comment -

          Here is a post I published on dev mailing list. (paste it here)

          Hi All Spark Devs,

          I am Kai Jiang, a master student majoring in Computer Science. Machine Learning and Distributed
          System are my interests. Due to that, I've been contributing to Spark codebase since last year. My
          Pull Requests are related to MLlib, PySpark and SQL.(https://github.com/apache/spark/pulls/vectorijk)

          Last time, I was impressed by the MechCoder's project mentored by mengxr. This year, I look forward
          to having a chance to do something interesting and want to extend my future contribution with Spark
          into a GSoC project. Thus, I was wondering if there are some specific ideas, issues or suggestions
          regarding MLlib (mainly), SQL or others could be gathered into a project. After looking into the MLlib 2.0
          Roadmap, I found there are many issues I am interested in (i.e Python/SparkR API for ML, PMML export,
          etc.). If community has other ideas, I am very willing to work on some issues before GSoC.

          I will put here a link of my very rough draft proposal later.

          Show
          vectorijk Kai Jiang added a comment - Here is a post I published on dev mailing list. (paste it here) Hi All Spark Devs, I am Kai Jiang, a master student majoring in Computer Science. Machine Learning and Distributed System are my interests. Due to that, I've been contributing to Spark codebase since last year. My Pull Requests are related to MLlib, PySpark and SQL.( https://github.com/apache/spark/pulls/vectorijk ) Last time, I was impressed by the MechCoder's project mentored by mengxr. This year, I look forward to having a chance to do something interesting and want to extend my future contribution with Spark into a GSoC project. Thus, I was wondering if there are some specific ideas, issues or suggestions regarding MLlib (mainly), SQL or others could be gathered into a project. After looking into the MLlib 2.0 Roadmap, I found there are many issues I am interested in (i.e Python/SparkR API for ML, PMML export, etc.). If community has other ideas, I am very willing to work on some issues before GSoC. I will put here a link of my very rough draft proposal later.
          Hide
          mengxr Xiangrui Meng added a comment -

          Yes, the features should be delivered to Spark codebase.

          Show
          mengxr Xiangrui Meng added a comment - Yes, the features should be delivered to Spark codebase.
          Hide
          mengxr Xiangrui Meng added a comment -

          We can roughly discuss the theme before working on a proposal. For example, in SPARK-6192 we decided to enhance MLlib's Python API and it went well. You can find Manoj Kumar's GSoC proposal here: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/manojkumar/5654792596619264. ML features in SparkR could be a good one along with continuing the work on Python APIs.

          Show
          mengxr Xiangrui Meng added a comment - We can roughly discuss the theme before working on a proposal. For example, in SPARK-6192 we decided to enhance MLlib's Python API and it went well. You can find Manoj Kumar 's GSoC proposal here: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/manojkumar/5654792596619264 . ML features in SparkR could be a good one along with continuing the work on Python APIs.
          Hide
          vectorijk Kai Jiang added a comment -

          Xiangrui Meng Thanks for mentioning that! I went through Kumar's Proposal and SPARK-6192 carefully. And I am really interested in your idea of keeping work on MLlib's Python API along with SparkR.
          Should I try to find out Python/R related tickets on JIRA? Or do you have some specific ideas or issues about SparkR and MLlib's Python API?

          Show
          vectorijk Kai Jiang added a comment - Xiangrui Meng Thanks for mentioning that! I went through Kumar's Proposal and SPARK-6192 carefully. And I am really interested in your idea of keeping work on MLlib's Python API along with SparkR. Should I try to find out Python/R related tickets on JIRA? Or do you have some specific ideas or issues about SparkR and MLlib's Python API?
          Hide
          josephkb Joseph K. Bradley added a comment -

          +1 for expanding the Python and R APIs. Manoj Kumar's work was great last year, and the expanded APIs help a lot of users.

          Kai Jiang I'd recommend doing some searching for the ML + PySpark component tags: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)
          as well as ML + SparkR tags: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)

          For Python, I'd recommend finding some major ones, but also listing one general item in the proposal: general Python API coverage.
          For R, there are lots of missing items, so I'd recommend picking the most important models which are missing.

          Show
          josephkb Joseph K. Bradley added a comment - +1 for expanding the Python and R APIs. Manoj Kumar 's work was great last year, and the expanded APIs help a lot of users. Kai Jiang I'd recommend doing some searching for the ML + PySpark component tags: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark) as well as ML + SparkR tags: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR) For Python, I'd recommend finding some major ones, but also listing one general item in the proposal: general Python API coverage. For R, there are lots of missing items, so I'd recommend picking the most important models which are missing.
          Hide
          vectorijk Kai Jiang added a comment -

          Joseph K. Bradley Thanks for your explanation! It seems like there are lots of missing models in SparkR. I opened a google docs (link) and put some ideas into it. Do you mind giving some suggestions about whether those ideas are suitable for GSoC project? cc Xiangrui Meng Nick Pentreath

          Show
          vectorijk Kai Jiang added a comment - Joseph K. Bradley Thanks for your explanation! It seems like there are lots of missing models in SparkR. I opened a google docs ( link ) and put some ideas into it. Do you mind giving some suggestions about whether those ideas are suitable for GSoC project? cc Xiangrui Meng Nick Pentreath
          Hide
          josephkb Joseph K. Bradley added a comment -

          Closing this now that the GSoC 2016 proposal and acceptance process is done.

          Show
          josephkb Joseph K. Bradley added a comment - Closing this now that the GSoC 2016 proposal and acceptance process is done.

            People

            • Assignee:
              mengxr Xiangrui Meng
              Reporter:
              mengxr Xiangrui Meng
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development