Details

    • Type: New Feature
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      MLlib's local linear algebra package doesn't have any support for any type of matrix operations. With 1.5, we wish to add support to a complete package of optimized linear algebra operations for Scala/Java users.

      The main goal is to support lazy operations so that element-wise can be implemented in a single for-loop, and complex operations can be interfaced through BLAS.

      The design doc: http://goo.gl/sf5LCE

        Issue Links

          Activity

          Hide
          srowen Sean Owen added a comment -

          Xiangrui Meng for Commons Math, and point #2: actually they decided to un-deprecate the sparse implementations in 3.3 onwards, and keep supporting them: http://commons.apache.org/proper/commons-math/changes-report.html I think it's a good option.

          But I also am not sure why Spark has to decide this for users. Spark can do whatever it likes internally; apps can do whatever they like externally; both can and should use a library. From an API perspective, all that's needed is a representation of the data that thunks easily into other libraries, rather than provide a library of functions again.

          Show
          srowen Sean Owen added a comment - Xiangrui Meng for Commons Math, and point #2: actually they decided to un-deprecate the sparse implementations in 3.3 onwards, and keep supporting them: http://commons.apache.org/proper/commons-math/changes-report.html I think it's a good option. But I also am not sure why Spark has to decide this for users. Spark can do whatever it likes internally; apps can do whatever they like externally; both can and should use a library. From an API perspective, all that's needed is a representation of the data that thunks easily into other libraries, rather than provide a library of functions again.
          Hide
          Rahul Palamuttam Rahul Palamuttam added a comment - - edited

          I think instead of breeze, nd4j from deeplearning4j is a good library to look at that tackles all 1,3,4. (2 is in the works )
          I've been using it as an experimental backend for SciSpark's sRDD

          It also has support for n-dimensions like numpy (unlike breeze). It's a very young project. They have one version published in mvnrepository. I use it by pulling from the git repo and and installing via maven locally.

          To give a general idea of how the nd4j library works.

          a) You can choose which backend you want - jblas, netlib-java, or x86. The x86 one uses netlib-java for BLAS operations and drops down to C level for-loops (via JNI) for elementwise-operations.

          b) They also provide a scala-api backend (with the operators) via DSL in a separate project called nd4s.

          c) They've recently provided the option of ordering elements contiguously in memory. So element-wise-operations now benefit from cache locality. (The performance is comparable to numpy and breeze)

          I guess the general consensus is the community is spread thin, so it could be worth it to wait for nd4j to mature a bit more and then tackle the problem.

          Show
          Rahul Palamuttam Rahul Palamuttam added a comment - - edited I think instead of breeze, nd4j from deeplearning4j is a good library to look at that tackles all 1,3,4. (2 is in the works ) I've been using it as an experimental backend for SciSpark's sRDD It also has support for n-dimensions like numpy (unlike breeze). It's a very young project. They have one version published in mvnrepository. I use it by pulling from the git repo and and installing via maven locally. To give a general idea of how the nd4j library works. a) You can choose which backend you want - jblas, netlib-java, or x86. The x86 one uses netlib-java for BLAS operations and drops down to C level for-loops (via JNI) for elementwise-operations. b) They also provide a scala-api backend (with the operators) via DSL in a separate project called nd4s. c) They've recently provided the option of ordering elements contiguously in memory. So element-wise-operations now benefit from cache locality. (The performance is comparable to numpy and breeze) I guess the general consensus is the community is spread thin, so it could be worth it to wait for nd4j to mature a bit more and then tackle the problem.
          Hide
          mengxr Xiangrui Meng added a comment -

          If there existed some linear algebra library in Java like numpy/scipy in Python, there would be absolutely no need to create a new one. There are couple factors we care:

          1. license
          2. sparse support
          3. performance
          4. Java compatibility

          We couldn't find one that meet all 4 requirements. For commons-math, I think the problems are 2 (they are deprecating the sparse library) and 3. For breeze, the problems are 4 and some 3. For MTJ, the problem is 1. For JBLAS/netlib-java, the problems are 2 and some concerns about 1. Those were considered in the PR that introduced sparse support a year ago. Unfortunately, Apache deleted the incubator-spark repo. But you can find the discussion here: http://apache-spark-developers-list.1001551.n3.nabble.com/GitHub-incubator-spark-pull-request-Proposal-Adding-sparse-data-suppor-tc954.html#none

          Initially, we only want to make a thin wrapper over breeze, but we decided to not expose breeze types in the public APIs, which is a general guideline across Spark components. But because of this, we received many complaints from users about lacking of linear algebra support. The code `toBreeze` and `fromBreeze` also make the implementation messy. Initially we only use limited operations from breeze, which we compared the performance (github.com/mengxr/linalg-test). Later on, we started using more breeze operations and hit performance issues. So we implement some BLAS routines for dense and sparse data and some operators that we need to get good performance without worrying about some Scala magic.

          To sum up, the demand for a linear algebra library comes from both external users and internal developers. The goal of this JIRA is an implementation that meets all 4 requirements. The work hasn't really started since I'm not very confident that we can meet all 4 requirements easily.

          Show
          mengxr Xiangrui Meng added a comment - If there existed some linear algebra library in Java like numpy/scipy in Python, there would be absolutely no need to create a new one. There are couple factors we care: 1. license 2. sparse support 3. performance 4. Java compatibility We couldn't find one that meet all 4 requirements. For commons-math, I think the problems are 2 (they are deprecating the sparse library) and 3. For breeze, the problems are 4 and some 3. For MTJ, the problem is 1. For JBLAS/netlib-java, the problems are 2 and some concerns about 1. Those were considered in the PR that introduced sparse support a year ago. Unfortunately, Apache deleted the incubator-spark repo. But you can find the discussion here: http://apache-spark-developers-list.1001551.n3.nabble.com/GitHub-incubator-spark-pull-request-Proposal-Adding-sparse-data-suppor-tc954.html#none Initially, we only want to make a thin wrapper over breeze, but we decided to not expose breeze types in the public APIs, which is a general guideline across Spark components. But because of this, we received many complaints from users about lacking of linear algebra support. The code `toBreeze` and `fromBreeze` also make the implementation messy. Initially we only use limited operations from breeze, which we compared the performance (github.com/mengxr/linalg-test). Later on, we started using more breeze operations and hit performance issues. So we implement some BLAS routines for dense and sparse data and some operators that we need to get good performance without worrying about some Scala magic. To sum up, the demand for a linear algebra library comes from both external users and internal developers. The goal of this JIRA is an implementation that meets all 4 requirements. The work hasn't really started since I'm not very confident that we can meet all 4 requirements easily.
          Hide
          josephkb Joseph K. Bradley added a comment -

          Xiangrui Meng Burak Yavuz What was the motivation behind not using commons math? Actually, I was not clear on the scope of this JIRA: Was it to create thin wrappers around Breeze or to re-implement some of the operations which are less efficient in Breeze?

          Sean Owen Concerning Breeze, I think the reason we don't expose it is that it does not promise stable APIs and is not backed by a big contributor base. But I agree with the sentiment that we are spread very thin and should be careful about these nice-but-not-necessary features.

          At any rate, I'm pretty sure this is not slated for 1.5, so I'll remove that target label at least.

          Show
          josephkb Joseph K. Bradley added a comment - Xiangrui Meng Burak Yavuz What was the motivation behind not using commons math? Actually, I was not clear on the scope of this JIRA: Was it to create thin wrappers around Breeze or to re-implement some of the operations which are less efficient in Breeze? Sean Owen Concerning Breeze, I think the reason we don't expose it is that it does not promise stable APIs and is not backed by a big contributor base. But I agree with the sentiment that we are spread very thin and should be careful about these nice-but-not-necessary features. At any rate, I'm pretty sure this is not slated for 1.5, so I'll remove that target label at least.
          Hide
          srowen Sean Owen added a comment -

          This is a follow-on from SPARK-9003, where the question was, what local linear algebra library can we use instead of rebuilding yet another one?

          Joseph K. Bradley Commons Math? or Breeze? These are already in use in Spark even. Commons Math is quite stable, still active, AL2, Java-friendly.

          So, Spark needs to do some local math internally. User apps need to do some local math too. The point of the light Vector API has been to shield callers from implementation details and transfer between these two domains in terms of a simple, generic representation. I can see the need to maybe add a few more methods to Vector as convenience methods, like map.

          Now, people can freely use, say, Commons Math locally in their app for any of this. Spark could offer yet another local math library, but what's the advantage?

          More generally I find Spark spread waay to thin, and would like to push back a lot more against yet more. I think it's always fun to look at slam-dunking problems like this again, but is it a good use of time? There is so much else that has been begun but not finished, or needs fixing. So I'm picking on this one. Maybe you have something in mind much more modest?

          Show
          srowen Sean Owen added a comment - This is a follow-on from SPARK-9003 , where the question was, what local linear algebra library can we use instead of rebuilding yet another one? Joseph K. Bradley Commons Math? or Breeze? These are already in use in Spark even. Commons Math is quite stable, still active, AL2, Java-friendly. So, Spark needs to do some local math internally. User apps need to do some local math too. The point of the light Vector API has been to shield callers from implementation details and transfer between these two domains in terms of a simple, generic representation. I can see the need to maybe add a few more methods to Vector as convenience methods, like map. Now, people can freely use, say, Commons Math locally in their app for any of this. Spark could offer yet another local math library, but what's the advantage? More generally I find Spark spread waay to thin, and would like to push back a lot more against yet more. I think it's always fun to look at slam-dunking problems like this again, but is it a good use of time? There is so much else that has been begun but not finished, or needs fixing. So I'm picking on this one. Maybe you have something in mind much more modest?
          Hide
          Rahul Palamuttam Rahul Palamuttam added a comment - - edited

          Hi!
          I'm fairly new here but have been dealing with similar issues concerning matrix operations on Spark.
          I noticed in the design docs the note about using JBLAS api since java does not support operator overloading.
          This is good but Mllib should also provide the overloaded operators. The operators would be functions wrapped around the suggest JBLAS api or the provided BLAS functions. I'm suggesting this based off of my experience with Nd4j which is a linear algebra library that allows users to switch between using the java api and scala operators for their linear algebra operations.

          Is it feasible to do this in Mllib?

          Show
          Rahul Palamuttam Rahul Palamuttam added a comment - - edited Hi! I'm fairly new here but have been dealing with similar issues concerning matrix operations on Spark. I noticed in the design docs the note about using JBLAS api since java does not support operator overloading. This is good but Mllib should also provide the overloaded operators. The operators would be functions wrapped around the suggest JBLAS api or the provided BLAS functions. I'm suggesting this based off of my experience with Nd4j which is a linear algebra library that allows users to switch between using the java api and scala operators for their linear algebra operations. Is it feasible to do this in Mllib?
          Hide
          emrehan Emrehan Tuzun added a comment -

          The version in the comment should be updated to 1.5.

          Show
          emrehan Emrehan Tuzun added a comment - The version in the comment should be updated to 1.5.

            People

            • Assignee:
              Unassigned
              Reporter:
              brkyvz Burak Yavuz
            • Votes:
              2 Vote for this issue
              Watchers:
              36 Start watching this issue

              Dates

              • Created:
                Updated:

                Development