Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      Provide API for SVM algorithm for DataFrames. I would recommend using OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.

      The API should mimic existing spark.ml.classification APIs.

        Issue Links

          Activity

          Hide
          yuhaoyan yuhao yang added a comment - - edited

          I'll start on this to give a quick prototype first. If time allows, I'm also thinking we should try with SMO.

          Show
          yuhaoyan yuhao yang added a comment - - edited I'll start on this to give a quick prototype first. If time allows, I'm also thinking we should try with SMO.
          Hide
          yuhaoyan yuhao yang added a comment -

          I put the prototype in https://github.com/hhbyyh/spark/blob/mlsvm/mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala. It's just a simple version with OWLQN and Hinge gradient.

          I plan to implement another version with parallel SMO before sending a pull request.

          Show
          yuhaoyan yuhao yang added a comment - I put the prototype in https://github.com/hhbyyh/spark/blob/mlsvm/mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala . It's just a simple version with OWLQN and Hinge gradient. I plan to implement another version with parallel SMO before sending a pull request.
          Hide
          yuhaoyan yuhao yang added a comment -

          I put the SMO version at https://github.com/hhbyyh/SVMOnSpark.
          The scalability is supposed to be very good as it avoids shuffle and unnecessary communication. It also supports arbitrary kernels. Currently linear and RBF are embedded.

          Show
          yuhaoyan yuhao yang added a comment - I put the SMO version at https://github.com/hhbyyh/SVMOnSpark . The scalability is supposed to be very good as it avoids shuffle and unnecessary communication. It also supports arbitrary kernels. Currently linear and RBF are embedded.
          Hide
          mlnick Nick Pentreath added a comment -

          It would be great to get the list of references for the SMO impl, as well as some concrete performance numbers (e.g. vs the other models in Spark)

          Show
          mlnick Nick Pentreath added a comment - It would be great to get the list of references for the SMO impl, as well as some concrete performance numbers (e.g. vs the other models in Spark)
          Hide
          yanboliang Yanbo Liang added a comment - - edited

          yuhao yang Any update about this? I think providing DataFrame-based SVM algorithm is very important to users, so it's better we can get it in ASAP. I'd like to get in the implementation with OWLQN and Hinge loss firstly, and to discuss SMO version later. Like Nick Pentreath said, it's better to get more performance number and user case of SMO impl. And it's not very hard to add a new internal implementation after we have the basic SVM API. I saw you have an implementation with OWLQN and Hinge loss already, could you send the PR? If you are busy with other things, I can help and you are still the primary author of this PR. Thanks!

          Show
          yanboliang Yanbo Liang added a comment - - edited yuhao yang Any update about this? I think providing DataFrame-based SVM algorithm is very important to users, so it's better we can get it in ASAP. I'd like to get in the implementation with OWLQN and Hinge loss firstly, and to discuss SMO version later. Like Nick Pentreath said, it's better to get more performance number and user case of SMO impl. And it's not very hard to add a new internal implementation after we have the basic SVM API. I saw you have an implementation with OWLQN and Hinge loss already, could you send the PR? If you are busy with other things, I can help and you are still the primary author of this PR. Thanks!
          Hide
          yuhaoyan yuhao yang added a comment -

          Thanks Yanbo Liang for picking this up. I'll try to send a PR tomorrow and we can work together on it. Thanks.

          Show
          yuhaoyan yuhao yang added a comment - Thanks Yanbo Liang for picking this up. I'll try to send a PR tomorrow and we can work together on it. Thanks.
          Hide
          JunKim Tae Jun Kim added a comment -

          Cheer up guys! I'm looking forward to DF based SVM
          Always thanks for implementations!

          Show
          JunKim Tae Jun Kim added a comment - Cheer up guys! I'm looking forward to DF based SVM Always thanks for implementations!
          Hide
          apachespark Apache Spark added a comment -

          User 'hhbyyh' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15211

          Show
          apachespark Apache Spark added a comment - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/15211
          Hide
          josephkb Joseph K. Bradley added a comment -

          Marking myself as shepherd per the 2.2 roadmap process, but others can take this over if they like.

          Show
          josephkb Joseph K. Bradley added a comment - Marking myself as shepherd per the 2.2 roadmap process, but others can take this over if they like.
          Hide
          josephkb Joseph K. Bradley added a comment -

          Issue resolved by pull request 15211
          https://github.com/apache/spark/pull/15211

          Show
          josephkb Joseph K. Bradley added a comment - Issue resolved by pull request 15211 https://github.com/apache/spark/pull/15211
          Hide
          felixcheung Felix Cheung added a comment - - edited

          Joseph K. Bradley should we add SparkR API as one follow up tasks? (I could shepherd that)

          Show
          felixcheung Felix Cheung added a comment - - edited Joseph K. Bradley should we add SparkR API as one follow up tasks? (I could shepherd that)

            People

            • Assignee:
              yuhaoyan yuhao yang
              Reporter:
              josephkb Joseph K. Bradley
              Shepherd:
              Joseph K. Bradley
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development