Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1529

Finalize abstraction of distributed logical plans from backend operations

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None

      Description

      We have a few situations when algorithm-facing API has Spark dependencies creeping in.

      In particular, we know of the following cases:
      (1) checkpoint() accepts Spark constant StorageLevel directly;
      (2) certain things in CheckpointedDRM;
      (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.

      (5) drmBroadcast returns a Spark-specific Broadcast object

      (6) Stratosphere/Flink conceptual api changes.

      Current tracker: PR #1 https://github.com/apache/mahout/pull/1 - closed, need new PR for remaining things once ready.
      Pull requests are welcome.

        Activity

        Hide
        ssc Sebastian Schelter added a comment -

        a few more points

        (4) SparkContext must be set as implicit val
        (5) drmBroadcast returns a Spark-specific Broadcast object

        Show
        ssc Sebastian Schelter added a comment - a few more points (4) SparkContext must be set as implicit val (5) drmBroadcast returns a Spark-specific Broadcast object
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        my thoughts on this:

        (1) factor out DRMLike and logical operators into math-scala module.
        (2) keep spark-specific physical op translations in the spark module.
        (3) create StorageLevel's verbatim analog in Mahout (this probably needs more careful handling – needs investigation how it really would map into Stratoshpere, if it all. But assuming for now we want to just walk away from direct Spark dependency in the code, a simple 1:1 translation is probably enough;
        (4) For drmParallelize() etc. set of routines I see really two ways of doing this.
        (4a) wrap engine-specific context into "Either-or" Mahout context.
        (4b) rely on assumption that these routines are not really used in engine-agnostic algorithms, so individual engine will provide semantically identical versions of those by import. At the very least, this will be required for createMahoutContext() call.
        I am really inclined to do (4a) not to lock ourselves into any assuptions except for createMahoutContext() which will have to go into engine-specifc package.

        I will have to think about CheckpointedDRM and CheckpointedDRM$rdd. Maybe the whole CheckpointedDRM also needs to be an engine-specific class.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - my thoughts on this: (1) factor out DRMLike and logical operators into math-scala module. (2) keep spark-specific physical op translations in the spark module. (3) create StorageLevel's verbatim analog in Mahout (this probably needs more careful handling – needs investigation how it really would map into Stratoshpere, if it all. But assuming for now we want to just walk away from direct Spark dependency in the code, a simple 1:1 translation is probably enough; (4) For drmParallelize() etc. set of routines I see really two ways of doing this. (4a) wrap engine-specific context into "Either-or" Mahout context. (4b) rely on assumption that these routines are not really used in engine-agnostic algorithms, so individual engine will provide semantically identical versions of those by import. At the very least, this will be required for createMahoutContext() call. I am really inclined to do (4a) not to lock ourselves into any assuptions except for createMahoutContext() which will have to go into engine-specifc package. I will have to think about CheckpointedDRM and CheckpointedDRM$rdd. Maybe the whole CheckpointedDRM also needs to be an engine-specific class.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        (4) SparkContext must be set as implicit val

        this was implied in (3)

        Show
        dlyubimov Dmitriy Lyubimov added a comment - (4) SparkContext must be set as implicit val this was implied in (3)
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        Sebastian Schelter what is the thinking about cache policies? Do we just map it to Spark's levels? what are considerations w.r.t. Stratoshere here?

        Show
        dlyubimov Dmitriy Lyubimov added a comment - Sebastian Schelter what is the thinking about cache policies? Do we just map it to Spark's levels? what are considerations w.r.t. Stratoshere here?
        Hide
        ssc Sebastian Schelter added a comment -

        I'll ask the Stratosphere guys to have a look

        Show
        ssc Sebastian Schelter added a comment - I'll ask the Stratosphere guys to have a look
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        I guess there's no concept of intermediate caching at all. Intstead, i guess, there's a possibility that stuff like writeDRM() is not triggering a computational action which always has to be triggered explicitly.

        Hm. how do we reconcile that.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - I guess there's no concept of intermediate caching at all. Intstead, i guess, there's a possibility that stuff like writeDRM() is not triggering a computational action which always has to be triggered explicitly. Hm. how do we reconcile that.
        Hide
        ssc Sebastian Schelter added a comment -

        In Stratosphere the optimizer makes the decision what to put in memory or on disk at what point. So there is no explicit caching, but nevertheless programs could have that and Stratosphere would use it as a hint to the optimizer. I asked on the newly established stratosphere dev list for someone to have a look at this issue.

        Show
        ssc Sebastian Schelter added a comment - In Stratosphere the optimizer makes the decision what to put in memory or on disk at what point. So there is no explicit caching, but nevertheless programs could have that and Stratosphere would use it as a hint to the optimizer. I asked on the newly established stratosphere dev list for someone to have a look at this issue.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        yes i figured as much. Multiple sinks are defined, and then the optimizer there just goes over what we were trying to do with collapsing common paths or something like that.

        This is actually pretty cool. However, it does pose a conundrum of reconciling with current semantics.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - yes i figured as much. Multiple sinks are defined, and then the optimizer there just goes over what we were trying to do with collapsing common paths or something like that. This is actually pretty cool. However, it does pose a conundrum of reconciling with current semantics.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        first idea is that we introduce and require an explicit computational action operator, something like drmExecute(implicit ctx) that would force computation in stratosphere, and in case of spark, be ignored.

        Similarly, cache instructions would just be ignored for Stratosphere.

        Generally, coarse-iterative algorithms (<50 iterations – ALS, ssvd) could just ignore drmExecute() api altogether, leaving it for the caller to execute.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - first idea is that we introduce and require an explicit computational action operator, something like drmExecute(implicit ctx) that would force computation in stratosphere, and in case of spark, be ignored. Similarly, cache instructions would just be ignored for Stratosphere. Generally, coarse-iterative algorithms (<50 iterations – ALS, ssvd) could just ignore drmExecute() api altogether, leaving it for the caller to execute.
        Hide
        ssc Sebastian Schelter added a comment -

        Why would we need that explicit execute operator for Stratosphere?

        Show
        ssc Sebastian Schelter added a comment - Why would we need that explicit execute operator for Stratosphere?
        Hide
        avati Anand Avati added a comment -

        Some thoughts -

        As an algo implementor, does one really care about platform specific details like checkpoint(mem) vs checkpoint(disk) vs cache() etc.? would it not be enough to present one generic call, like .materialize() which would either trigger the computation in the physical layer (or give it a hint)? For persistence, why not just have the explicit .writeDRM() and be done? So as an API consumer there is:

        .materialize() – (trigger optimizer and computation, thereby avoiding future duplicate evaluations, translates to .checkpoint(MEM) in spark for e.g).
        .writeDRM(filename) – serialize computed DRM to persistence store (implies materialization if not already)

        Show
        avati Anand Avati added a comment - Some thoughts - As an algo implementor, does one really care about platform specific details like checkpoint(mem) vs checkpoint(disk) vs cache() etc.? would it not be enough to present one generic call, like .materialize() which would either trigger the computation in the physical layer (or give it a hint)? For persistence, why not just have the explicit .writeDRM() and be done? So as an API consumer there is: .materialize() – (trigger optimizer and computation, thereby avoiding future duplicate evaluations, translates to .checkpoint(MEM) in spark for e.g). .writeDRM(filename) – serialize computed DRM to persistence store (implies materialization if not already)
        Hide
        ssc Sebastian Schelter added a comment -

        The problem with having a no-arg materialize operator is that our optimizer would have to make the decision how to materialize the data for spark (in-memory, in-memory in a serialized fashion, on disk). I don't think that we can/should make that decision ourselves. If people run into OOMs with spark we have to give them something to work around that (e.g. allow them to tell the system to use a different storage level).

        What do you think about keeping those storagelevels, but interpret them as hints to the underlying system, which indicates to the user that the system might make a (hopefully) smarter decision, e.g. something like

        drm.cache(CacheHint.IN_MEMORY)
        Show
        ssc Sebastian Schelter added a comment - The problem with having a no-arg materialize operator is that our optimizer would have to make the decision how to materialize the data for spark (in-memory, in-memory in a serialized fashion, on disk). I don't think that we can/should make that decision ourselves. If people run into OOMs with spark we have to give them something to work around that (e.g. allow them to tell the system to use a different storage level). What do you think about keeping those storagelevels, but interpret them as hints to the underlying system, which indicates to the user that the system might make a (hopefully) smarter decision, e.g. something like drm.cache(CacheHint.IN_MEMORY)
        Hide
        dlyubimov Dmitriy Lyubimov added a comment - - edited

        Anand Avati the use patterns of optimizer checkpoints are discussed at length in my talk. Two basics use cases are explicit management of cache policies and common computational path.

        Sebastian Schelter

        Why would we need that explicit execute operator for Stratosphere?

        Correct me if i am reading Stratosphere wrong. (I still haven't run a single program on it, please forgive me being a bit superficial here). Stratosphere programming api implies that we may define more than 1 sink in the graph (i.e. writeDRM() calls) without triggering computational action. How would we trigger it if sink definitions such as writeDRM don't trigger it anymore?

        Also not clear with collect() stuff, i guess it doesn't have a direct mapping either until Stephan finishes his promised piece on it.

        -d

        Show
        dlyubimov Dmitriy Lyubimov added a comment - - edited Anand Avati the use patterns of optimizer checkpoints are discussed at length in my talk. Two basics use cases are explicit management of cache policies and common computational path. Sebastian Schelter Why would we need that explicit execute operator for Stratosphere? Correct me if i am reading Stratosphere wrong. (I still haven't run a single program on it, please forgive me being a bit superficial here). Stratosphere programming api implies that we may define more than 1 sink in the graph (i.e. writeDRM() calls) without triggering computational action. How would we trigger it if sink definitions such as writeDRM don't trigger it anymore? Also not clear with collect() stuff, i guess it doesn't have a direct mapping either until Stephan finishes his promised piece on it. -d
        Hide
        avati Anand Avati added a comment -

        I think we also need

        (6) Rename mahout spark-shell (both command and source dir/files/variables) to "mahout shell" (or mahout console?) which only uses the logical layer and backend layer is selected at runtime/startup.

        Should this be a separate JIRA? It has an overlap with this JIRA I think.

        Show
        avati Anand Avati added a comment - I think we also need (6) Rename mahout spark-shell (both command and source dir/files/variables) to "mahout shell" (or mahout console?) which only uses the logical layer and backend layer is selected at runtime/startup. Should this be a separate JIRA? It has an overlap with this JIRA I think.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        I think we also need

        (6) Rename mahout spark-shell (both command and source dir/files/variables) to "mahout shell" (or mahout console?) which only uses the logical layer and backend layer is selected at runtime/startup.

        No, we don't . Shell is in essense Spark's REPL. in that sense it is exactly and literally spark-shell. It includes byte code mechanisms to compile closures on-the-fly and pass them to the backend.

        How other engines would want to do that, i have no clue. Chances for a generic (and cheap) Mahout shell are very slim IMO.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - I think we also need (6) Rename mahout spark-shell (both command and source dir/files/variables) to "mahout shell" (or mahout console?) which only uses the logical layer and backend layer is selected at runtime/startup. No, we don't . Shell is in essense Spark's REPL. in that sense it is exactly and literally spark-shell. It includes byte code mechanisms to compile closures on-the-fly and pass them to the backend. How other engines would want to do that, i have no clue. Chances for a generic (and cheap) Mahout shell are very slim IMO.
        Hide
        avati Anand Avati added a comment -

        Dmitriy Lyubimov, are you actively working on this separation? I had started the separation work, and if you are actively working on this I will abandon my effort.

        Show
        avati Anand Avati added a comment - Dmitriy Lyubimov , are you actively working on this separation? I had started the separation work, and if you are actively working on this I will abandon my effort.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        IMO this doesn't tolerate haste. This is very conceptual. I won't commit anything until we have finished discussion on the things i outlined for Stratosphere.

        There are little items that i may commit soon (such as wrapping up context and cache manager).

        I think the best mode to work this issue is to create a github side branch and keep squash-committing it little by little while handling little additions via pull request.

        BTW I truly envy the Spark process which is 100% handled by github pull requests. Does anybody knows how they manage to push it back to Apache from there?

        Show
        dlyubimov Dmitriy Lyubimov added a comment - IMO this doesn't tolerate haste. This is very conceptual. I won't commit anything until we have finished discussion on the things i outlined for Stratosphere. There are little items that i may commit soon (such as wrapping up context and cache manager). I think the best mode to work this issue is to create a github side branch and keep squash-committing it little by little while handling little additions via pull request. BTW I truly envy the Spark process which is 100% handled by github pull requests. Does anybody knows how they manage to push it back to Apache from there?
        Show
        dlyubimov Dmitriy Lyubimov added a comment - tracking here https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529
        Hide
        avati Anand Avati added a comment -

        Dmitriy Lyubimov, the purpose of the spark-shell, AFAICT, is to present and interactive interface to work on the DSL. To that end, there is little reason to use the Spark REPL for the purpose.. we could use a vanilla Scala REPL with DSL and operators pre-loaded (imported). And this should work just as fine with Spark, as using the Spark REPL. Doesn't that feel much cleaner?

        Show
        avati Anand Avati added a comment - Dmitriy Lyubimov , the purpose of the spark-shell, AFAICT, is to present and interactive interface to work on the DSL. To that end, there is little reason to use the Spark REPL for the purpose.. we could use a vanilla Scala REPL with DSL and operators pre-loaded (imported). And this should work just as fine with Spark, as using the Spark REPL. Doesn't that feel much cleaner?
        Hide
        dlyubimov Dmitriy Lyubimov added a comment - - edited

        yes it's cleaner but it is not even clear if it is achievable, and if it is, it is expensive. Like i said, you are welcome to try – if it works with Spark identically to REPL, there will be no arguments not to use it in favor of REPL.

        But my budget on this is very limited, so it is not most pragmatical path for me to get things done.

        Bottom line, for today something that works today beats something hypothetical.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - - edited yes it's cleaner but it is not even clear if it is achievable, and if it is, it is expensive. Like i said, you are welcome to try – if it works with Spark identically to REPL, there will be no arguments not to use it in favor of REPL. But my budget on this is very limited, so it is not most pragmatical path for me to get things done. Bottom line, for today something that works today beats something hypothetical.
        Hide
        avati Anand Avati added a comment -

        Dmitriy Lyubimov, by more "expensive", I assume (and hope) you (only) mean expensive for you in terms of time. Or did you mean expensive in some other way? I am willing to investigate the feasibility - I always intended to mean that. The question really was, do you think it is relevant enough to the same JIRA or worthy of a new one?

        Show
        avati Anand Avati added a comment - Dmitriy Lyubimov , by more "expensive", I assume (and hope) you (only) mean expensive for you in terms of time. Or did you mean expensive in some other way? I am willing to investigate the feasibility - I always intended to mean that. The question really was, do you think it is relevant enough to the same JIRA or worthy of a new one?
        Hide
        dlyubimov Dmitriy Lyubimov added a comment - - edited

        it is expensive for anybody's time (compared to REPL adaptation). I certainly won't try to do it at this point. If you want to do it, yes, please file a new jira.

        Also, REPL cannot be used with mahout as is. So yes, it is well warranted what we did.

        I am not sure about re-branding, we did too little to warrant that indeed, but REPL can't work with Mahout, or at the very least it is awkward to do so manually (i.e. tracing all mahout jar dependencies, add them to session and make sure all proper imports are done).

        Show
        dlyubimov Dmitriy Lyubimov added a comment - - edited it is expensive for anybody's time (compared to REPL adaptation). I certainly won't try to do it at this point. If you want to do it, yes, please file a new jira. Also, REPL cannot be used with mahout as is. So yes, it is well warranted what we did. I am not sure about re-branding, we did too little to warrant that indeed, but REPL can't work with Mahout, or at the very least it is awkward to do so manually (i.e. tracing all mahout jar dependencies, add them to session and make sure all proper imports are done).
        Hide
        tdunning Ted Dunning added a comment -

        BTW I truly envy the Spark process which is 100% handled by github pull requests. Does anybody knows how they manage to push it back to Apache from there?

        It would require that we switch over to git for the main mahout repo.

        If we want to discuss that, we should move to a separate thread.

        Show
        tdunning Ted Dunning added a comment - BTW I truly envy the Spark process which is 100% handled by github pull requests. Does anybody knows how they manage to push it back to Apache from there? It would require that we switch over to git for the main mahout repo. If we want to discuss that, we should move to a separate thread.
        Hide
        ssc Sebastian Schelter added a comment -

        I just know that it was discussed during their graduation. We could
        simply ask on their mailinglist how they do it.

        --sebastian

        Show
        ssc Sebastian Schelter added a comment - I just know that it was discussed during their graduation. We could simply ask on their mailinglist how they do it. --sebastian
        Hide
        avati Anand Avati added a comment -

        Dmitriy Lyubimov, Sebastian Schelter, would appreciate comments/reviews on the shell separation patch at https://issues.apache.org/jira/browse/MAHOUT-1544

        Show
        avati Anand Avati added a comment - Dmitriy Lyubimov , Sebastian Schelter , would appreciate comments/reviews on the shell separation patch at https://issues.apache.org/jira/browse/MAHOUT-1544
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Mahout-Quality #2606 (See https://builds.apache.org/job/Mahout-Quality/2606/)
        MAHOUT-1529: completely abstracting away dssvd, dspca and dqr: introducing CacheHint enum

        Squashed commit of the following:

        commit 6c4bf1650f0e87d0d1fc5b9b23c94f6e3553b74d
        Merge: a748e8b 0c5a754
        Author: Dmitriy Lyubimov <dlyubimov@apache.org>
        Date: Tue May 6 18:20:36 2014 -0700

        Merge branch 'trunk' into MAHOUT-1529

        commit a748e8b8be2ad7ce44af231147b236726704b561
        Author: Dmitriy Lyubimov <dlyubimov@apache.org>
        Date: Tue May 6 18:19:35 2014 -0700

        MAHOUT-1529: completely abstracting away dssvd, dspca and dqr: introducing CacheHint enum (dlyubimov: rev 1592933)

        • /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala
        • /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala
        • /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CacheHint.scala
        • /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala
        • /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmLike.scala
        • /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala
        • /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/CheckpointAction.scala
        • /mahout/trunk/spark/src/test/scala/org/apache/mahout/sparkbindings/decompositions/MathSuite.scala
        • /mahout/trunk/spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Mahout-Quality #2606 (See https://builds.apache.org/job/Mahout-Quality/2606/ ) MAHOUT-1529 : completely abstracting away dssvd, dspca and dqr: introducing CacheHint enum Squashed commit of the following: commit 6c4bf1650f0e87d0d1fc5b9b23c94f6e3553b74d Merge: a748e8b 0c5a754 Author: Dmitriy Lyubimov <dlyubimov@apache.org> Date: Tue May 6 18:20:36 2014 -0700 Merge branch 'trunk' into MAHOUT-1529 commit a748e8b8be2ad7ce44af231147b236726704b561 Author: Dmitriy Lyubimov <dlyubimov@apache.org> Date: Tue May 6 18:19:35 2014 -0700 MAHOUT-1529 : completely abstracting away dssvd, dspca and dqr: introducing CacheHint enum (dlyubimov: rev 1592933) /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CacheHint.scala /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmLike.scala /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala /mahout/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/CheckpointAction.scala /mahout/trunk/spark/src/test/scala/org/apache/mahout/sparkbindings/decompositions/MathSuite.scala /mahout/trunk/spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala
        Hide
        avati Anand Avati added a comment -

        Another thing I notice is that drmBroadcast() returns a raw org.apache.spark.Broadcast variable. I'm thinking a simple wrapper around it to create an abstraction for various backends would be nice. Thoughts?

        Show
        avati Anand Avati added a comment - Another thing I notice is that drmBroadcast() returns a raw org.apache.spark.Broadcast variable. I'm thinking a simple wrapper around it to create an abstraction for various backends would be nice. Thoughts?
        Hide
        avati Anand Avati added a comment -

        I see drmBroadcast() has already been listed (somehow did not find it last time I saw)

        Show
        avati Anand Avati added a comment - I see drmBroadcast() has already been listed (somehow did not find it last time I saw)
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        ok i started nudging this a bit forward, did a couple of fairly drastical refactoring, moving api parts to math-scala. math-scala should compile . Decompositions are moved too.

        Things left include moving package-level routines requiring implicit context; fixing spark and spark-shell modules; moving tests where appropriate.

        With tests. a little conundrum is such that we don't have a "local" engine – we would use "Spark local" for that, i.e. some concrete engine. So even though decomposition code now completely lives in math-scala with no spark dependencies, it looks like its tests will still have to live in spark module, where unit-testing in local spark mode is defined. It kinda makes sense, since we probably will want to run MathSuite separately for each engine we add; but is a bit weird since it keeps something like ssvd() and its engine-specific tests apart.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - ok i started nudging this a bit forward, did a couple of fairly drastical refactoring, moving api parts to math-scala. math-scala should compile . Decompositions are moved too. Things left include moving package-level routines requiring implicit context; fixing spark and spark-shell modules; moving tests where appropriate. With tests. a little conundrum is such that we don't have a "local" engine – we would use "Spark local" for that, i.e. some concrete engine. So even though decomposition code now completely lives in math-scala with no spark dependencies, it looks like its tests will still have to live in spark module, where unit-testing in local spark mode is defined. It kinda makes sense, since we probably will want to run MathSuite separately for each engine we add; but is a bit weird since it keeps something like ssvd() and its engine-specific tests apart.
        Hide
        avati Anand Avati added a comment - - edited

        Dmitriy Lyubimov, I had a quick look at the commits, and it looks a lot cleaner separation now. Some comments:

        • Should DrmLike really be a generic class like DrmLike[T] where T is unbounded? For e.g, it does not make sense to have DrmLike[String]. The only meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway we can restrict DrmLike to just Int and Double? Or fixate on just Double? While RDD supports arbitrary T, H2O supports only numeric types which is sufficient for Mahout's needs.

        UPDATE: I see that historically DRM's row index need not necessarily be numerical. In practice could this be anything other than a number or string?

        • I am toying around with the new separation, to build a pure/from scratch local/in-memory "backend" which communicates through a ByteArrayStream Java serialization. I am hoping this will not only serve as a reference for future backend implementors, but also help to keep test cases of the algorithms inside math-scala. Thoughts?
        • 'type DrmTuple[k] = (K, Vector)' is probably better placed in spark/../package.scala I think, as it is really an artifact of how the RDD is defined. However, BlockifiedDrmTuple[K] probably still belongs to math-scala.
        Show
        avati Anand Avati added a comment - - edited Dmitriy Lyubimov , I had a quick look at the commits, and it looks a lot cleaner separation now. Some comments: Should DrmLike really be a generic class like DrmLike [T] where T is unbounded? For e.g, it does not make sense to have DrmLike [String] . The only meaningful ones probably are DrmLike [Int] and DrmLike [Double] . Is there someway we can restrict DrmLike to just Int and Double? Or fixate on just Double? While RDD supports arbitrary T, H2O supports only numeric types which is sufficient for Mahout's needs. UPDATE: I see that historically DRM's row index need not necessarily be numerical. In practice could this be anything other than a number or string? I am toying around with the new separation, to build a pure/from scratch local/in-memory "backend" which communicates through a ByteArrayStream Java serialization. I am hoping this will not only serve as a reference for future backend implementors, but also help to keep test cases of the algorithms inside math-scala. Thoughts? 'type DrmTuple [k] = (K, Vector)' is probably better placed in spark/../package.scala I think, as it is really an artifact of how the RDD is defined. However, BlockifiedDrmTuple [K] probably still belongs to math-scala.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        DRM is legacy Mahout format inherited from all map reduce solvers.

        Perhaps one of the most popular commands, `seq2sparse`, produces string keys (full document path name in the original corpus). A lot of solvers are agnostic propagators of the keys: SSVD -> U, both MR and DSL versions, so is DSPCA, thinQR, and (I think) current and future versions of factorizes such as ALS. For more examples of what key can be, see "Mahout In Action" – or bug the authors. Going forward, i am very likely internally use a more involved object structures as a key payload.

        I honestly don't see value in a separate "local" backend as Spark already provides one. It is very unlikely to be used.

        Tuple definitions don't depend on Spark, at this point i don't see a reason to make them engine-specific.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - DRM is legacy Mahout format inherited from all map reduce solvers. Perhaps one of the most popular commands, `seq2sparse`, produces string keys (full document path name in the original corpus). A lot of solvers are agnostic propagators of the keys: SSVD -> U, both MR and DSL versions, so is DSPCA, thinQR, and (I think) current and future versions of factorizes such as ALS. For more examples of what key can be, see "Mahout In Action" – or bug the authors. Going forward, i am very likely internally use a more involved object structures as a key payload. I honestly don't see value in a separate "local" backend as Spark already provides one. It is very unlikely to be used. Tuple definitions don't depend on Spark, at this point i don't see a reason to make them engine-specific.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        This is now tracked in github pull request #1 . As far as i understand, patches to this issue require issuing PRs to the origin of this PR (branch MAHOUT-1529-a in github.com:dlyubimov/mahout fork).

        Show
        dlyubimov Dmitriy Lyubimov added a comment - This is now tracked in github pull request #1 . As far as i understand, patches to this issue require issuing PRs to the origin of this PR (branch MAHOUT-1529 -a in github.com:dlyubimov/mahout fork).
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        Sebastian Schelter (or whomever wants to) could you please take a look at PR #1 on github? I need at least one review to continue. I want to commit it in order not to diverge too much, it will be more difficult the longer i wait.

        This does not finalize issue w.r.t Stratosphere model (especially its multi-sink model) but we can tackle that later once they are closer to what they said they'd do. thanks.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - Sebastian Schelter (or whomever wants to) could you please take a look at PR #1 on github? I need at least one review to continue. I want to commit it in order not to diverge too much, it will be more difficult the longer i wait. This does not finalize issue w.r.t Stratosphere model (especially its multi-sink model) but we can tackle that later once they are closer to what they said they'd do. thanks.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jfarrell commented on the pull request:

        https://github.com/apache/mahout/pull/1#issuecomment-44318648

        MAHOUT-1529 not linking to jira as discussed in INFRA-7801

        Show
        githubbot ASF GitHub Bot added a comment - Github user jfarrell commented on the pull request: https://github.com/apache/mahout/pull/1#issuecomment-44318648 MAHOUT-1529 not linking to jira as discussed in INFRA-7801
        Hide
        ssc Sebastian Schelter added a comment -

        Hi Dmitriy,

        the PR looks good, +1 from me, go ahead!

        Best,
        Sebastian

        Show
        ssc Sebastian Schelter added a comment - Hi Dmitriy, the PR looks good, +1 from me, go ahead! Best, Sebastian
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/mahout/pull/1

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/1
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2620 (See https://builds.apache.org/job/Mahout-Quality/2620/)
        MAHOUT-1529 closes PR #1 (dlyubimov: rev 8714a0f722663ea5cb16c14c5b8a01e57574cd93)

        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtAnyKey.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/SparkBCast.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmLikeOps.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAx.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/CacheHint.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAx.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/CheckpointAction.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AtA.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpRowRange.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpABt.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/BCast.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtB.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmRddInput.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/MapBlock.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/RLikeDrmOps.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AewBSuite.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtA.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AinCoreB.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOps.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmLike.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedContext.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtx.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/SparkDistributedContext.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtSuite.scala
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeOpsSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/test/MahoutLocalContext.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABAnyKey.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/At.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractBinaryOp.scala
        • math-scala/pom.xml
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/decompositions/SSVD.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CacheHint.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/package.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAB.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/ABtSuite.scala
        • spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/MahoutSparkILoop.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLike.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtB.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/AbstractBinaryOp.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpTimesRightMatrix.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAewB.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedOps.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpABAnyKey.scala
        • spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrm.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/CheckpointAction.scala
        • spark-shell/src/test/mahout/simple.mscala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABt.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAewScalar.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtAnyKey.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Slicing.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/decompositions/DSPCA.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/decompositions/MathSuite.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAt.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSparkOps.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpMapBlock.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtx.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/decompositions/DQR.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/package.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewScalar.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtASuite.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/DrmRddOps.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AewB.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAt.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedOps.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractUnaryOp.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/ABt.scala
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpTimesLeftMatrix.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpMapBlock.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala
        • spark/pom.xml
        • math-scala/src/main/scala/org/apache/mahout/math/drm/decompositions/DSSVD.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAB.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpRowRange.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLikeOps.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedDrm.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Ax.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesLeftMatrix.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewB.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedEngine.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AtB.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/AbstractUnaryOp.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesRightMatrix.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtA.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2620 (See https://builds.apache.org/job/Mahout-Quality/2620/ ) MAHOUT-1529 closes PR #1 (dlyubimov: rev 8714a0f722663ea5cb16c14c5b8a01e57574cd93) spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtAnyKey.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/SparkBCast.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmLikeOps.scala spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAx.scala math-scala/src/main/scala/org/apache/mahout/math/drm/CacheHint.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAx.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/CheckpointAction.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AtA.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpRowRange.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpABt.scala math-scala/src/main/scala/org/apache/mahout/math/drm/BCast.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtB.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmRddInput.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/MapBlock.scala math-scala/src/main/scala/org/apache/mahout/math/drm/RLikeDrmOps.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AewBSuite.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtA.scala spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AinCoreB.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOps.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/DrmLike.scala math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedContext.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtx.scala spark/src/main/scala/org/apache/mahout/sparkbindings/SparkDistributedContext.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtSuite.scala math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeOpsSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/test/MahoutLocalContext.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABAnyKey.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/At.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractBinaryOp.scala math-scala/pom.xml math-scala/src/main/scala/org/apache/mahout/math/scalabindings/decompositions/SSVD.scala spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CacheHint.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/package.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAB.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/ABtSuite.scala spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/MahoutSparkILoop.scala math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLike.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtB.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/AbstractBinaryOp.scala spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpTimesRightMatrix.scala spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAewB.scala math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedOps.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpABAnyKey.scala spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrm.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/CheckpointAction.scala spark-shell/src/test/mahout/simple.mscala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABt.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAewScalar.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtAnyKey.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Slicing.scala math-scala/src/main/scala/org/apache/mahout/math/drm/decompositions/DSPCA.scala spark/src/test/scala/org/apache/mahout/sparkbindings/decompositions/MathSuite.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAt.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSparkOps.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpMapBlock.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAtx.scala spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala math-scala/src/main/scala/org/apache/mahout/math/drm/decompositions/DQR.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/package.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewScalar.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtASuite.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/DrmRddOps.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AewB.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAt.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedOps.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractUnaryOp.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/ABt.scala math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpTimesLeftMatrix.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpMapBlock.scala spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala spark/pom.xml math-scala/src/main/scala/org/apache/mahout/math/drm/decompositions/DSSVD.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpAB.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/OpRowRange.scala math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLikeOps.scala math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedDrm.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Ax.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesLeftMatrix.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewB.scala math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedEngine.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AtB.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/plan/AbstractUnaryOp.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesRightMatrix.scala math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtA.scala
        Hide
        gokhancapan Gokhan Capan added a comment - - edited

        Dmitriy Lyubimov, I imagine in the near future we will want to add a matrix implementation with fast row and column access for memory-based algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else.

        Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.).

        So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality.

        Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled?

        Show
        gokhancapan Gokhan Capan added a comment - - edited Dmitriy Lyubimov , I imagine in the near future we will want to add a matrix implementation with fast row and column access for memory-based algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled?
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        Gokhan Capan As i explained before, i don't favor common hierarchy for matrices with vastly different programming models. I can roll about a dozen of bona-fide arguments as to why, but i am kind of tired speaking on the topic. Main inference of this work is that common algebraic traits are semantically identical but are not always identical in signatures and more over these signature mis-identities do not necessary follow any functional split. This makes common hierarchies inelegant. Second takeaway from this work is that DSL features + IDEA scala plugin are more than a match for OOA approach as far as readability and maintainability is concerned. If we go by Straustrup's circa 1989 claim that OOA is nothing but code organization for maintainability purposes, then DSL renders strict OOA design moot.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - Gokhan Capan As i explained before, i don't favor common hierarchy for matrices with vastly different programming models. I can roll about a dozen of bona-fide arguments as to why, but i am kind of tired speaking on the topic. Main inference of this work is that common algebraic traits are semantically identical but are not always identical in signatures and more over these signature mis-identities do not necessary follow any functional split. This makes common hierarchies inelegant. Second takeaway from this work is that DSL features + IDEA scala plugin are more than a match for OOA approach as far as readability and maintainability is concerned. If we go by Straustrup's circa 1989 claim that OOA is nothing but code organization for maintainability purposes, then DSL renders strict OOA design moot.
        Hide
        gokhancapan Gokhan Capan added a comment -

        Alright, I'm sold.

        Show
        gokhancapan Gokhan Capan added a comment - Alright, I'm sold.
        Hide
        tdunning Ted Dunning added a comment -

        I am not sold but I don't think it is germane to this bug. I think that the details of whether it is exactly done with common inheritance or standardized traits is an open question, but I think that the overall push that we need some common characteristics that can be communicated easily to the person writing code is very important.

        We also need to support some things that are very inefficient for cases where it just needs to be done. It is easy to come up with scenarios like diagnostic systems that need to get the value of a single cell and damn the cost. That doesn't make element by element access a primary idiom. It just means that it is possible to do.

        Show
        tdunning Ted Dunning added a comment - I am not sold but I don't think it is germane to this bug. I think that the details of whether it is exactly done with common inheritance or standardized traits is an open question, but I think that the overall push that we need some common characteristics that can be communicated easily to the person writing code is very important. We also need to support some things that are very inefficient for cases where it just needs to be done. It is easy to come up with scenarios like diagnostic systems that need to get the value of a single cell and damn the cost. That doesn't make element by element access a primary idiom. It just means that it is possible to do.
        Hide
        dlyubimov Dmitriy Lyubimov added a comment -

        There's a very well established way of communicating DSL to end users by
        now (see, for example, scalatest manual). Literally dozen projects by now.
        None of these projects i know goes by explaining object model exposing DSL
        to the end user.

        Show
        dlyubimov Dmitriy Lyubimov added a comment - There's a very well established way of communicating DSL to end users by now (see, for example, scalatest manual). Literally dozen projects by now. None of these projects i know goes by explaining object model exposing DSL to the end user.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user avati commented on the pull request:

        https://github.com/apache/mahout/pull/15#issuecomment-45653931

        I assumed this is part of MAHOUT-1529 itself (which renamed @sc to @sdc). Let me resubmit with MAHOUT-1529 in the commit message?

        Show
        githubbot ASF GitHub Bot added a comment - Github user avati commented on the pull request: https://github.com/apache/mahout/pull/15#issuecomment-45653931 I assumed this is part of MAHOUT-1529 itself (which renamed @sc to @sdc). Let me resubmit with MAHOUT-1529 in the commit message?
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on the pull request:

        https://github.com/apache/mahout/pull/15#issuecomment-45654741

        1529 is closed now. besides, it doesn't have anything to do with shell.

        it's fine this is a small change, i'll merge it without issue

        On Tue, Jun 10, 2014 at 11:38 AM, Anand Avati <notifications@github.com>
        wrote:

        > I assumed this is part of MAHOUT-1529 itself (which renamed @sc
        > <https://github.com/sc> to @sdc <https://github.com/sdc>). Let me
        > resubmit with MAHOUT-1529 in the commit message?
        >
        > —
        > Reply to this email directly or view it on GitHub
        > <https://github.com/apache/mahout/pull/15#issuecomment-45653931>.
        >

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/15#issuecomment-45654741 1529 is closed now. besides, it doesn't have anything to do with shell. it's fine this is a small change, i'll merge it without issue On Tue, Jun 10, 2014 at 11:38 AM, Anand Avati <notifications@github.com> wrote: > I assumed this is part of MAHOUT-1529 itself (which renamed @sc > < https://github.com/sc > to @sdc < https://github.com/sdc >). Let me > resubmit with MAHOUT-1529 in the commit message? > > — > Reply to this email directly or view it on GitHub > < https://github.com/apache/mahout/pull/15#issuecomment-45653931 >. >
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2690 (See https://builds.apache.org/job/Mahout-Quality/2690/)
        MAHOUT-1529: third collection of various edits against private branch (dlyubimov: rev e4ba7887fc6dbf17c3d73f8d4aa1045eeb48d53e)

        • spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/VectorOps.scala
        • spark/src/test/scala/org/apache/mahout/math/decompositions/MathSuite.scala
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/RLikeOps.scala
        • math-scala/src/main/scala/org/apache/mahout/math/decompositions/package.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/blas/package.scala
        • math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2690 (See https://builds.apache.org/job/Mahout-Quality/2690/ ) MAHOUT-1529 : third collection of various edits against private branch (dlyubimov: rev e4ba7887fc6dbf17c3d73f8d4aa1045eeb48d53e) spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala math-scala/src/main/scala/org/apache/mahout/math/scalabindings/VectorOps.scala spark/src/test/scala/org/apache/mahout/math/decompositions/MathSuite.scala math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala math-scala/src/main/scala/org/apache/mahout/math/scalabindings/RLikeOps.scala math-scala/src/main/scala/org/apache/mahout/math/decompositions/package.scala spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala spark/src/main/scala/org/apache/mahout/sparkbindings/blas/package.scala math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2698 (See https://builds.apache.org/job/Mahout-Quality/2698/)
        MAHOUT-1529 (d): moving core engine-independent tests logic to math-scala, spark module running them. (dlyubimov: rev 25a6fc0967357e6ba4aafcaf11bf3f7faec752fd)

        • spark/src/test/scala/org/apache/mahout/math/decompositions/DistributedDecompositionsSuite.scala
        • math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeSuiteBase.scala
        • spark/src/test/scala/org/apache/mahout/math/decompositions/MathSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeOpsSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/test/DistributedSparkSuite.scala
        • math-scala/src/test/scala/org/apache/mahout/math/decompositions/DistributedDecompositionsSuiteBase.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/ABtSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/test/LoggerConfiguration.scala
        • spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        • math-scala/src/test/scala/org/apache/mahout/math/drm/RLikeDrmOpsSuiteBase.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtASuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/test/MahoutLocalContext.scala
        • math-scala/src/test/scala/org/apache/mahout/test/DistributedMahoutSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AewBSuite.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala
        • math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeOpsSuiteBase.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2698 (See https://builds.apache.org/job/Mahout-Quality/2698/ ) MAHOUT-1529 (d): moving core engine-independent tests logic to math-scala, spark module running them. (dlyubimov: rev 25a6fc0967357e6ba4aafcaf11bf3f7faec752fd) spark/src/test/scala/org/apache/mahout/math/decompositions/DistributedDecompositionsSuite.scala math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeSuiteBase.scala spark/src/test/scala/org/apache/mahout/math/decompositions/MathSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeOpsSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/test/DistributedSparkSuite.scala math-scala/src/test/scala/org/apache/mahout/math/decompositions/DistributedDecompositionsSuiteBase.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/ABtSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/test/LoggerConfiguration.scala spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala math-scala/src/test/scala/org/apache/mahout/math/drm/RLikeDrmOpsSuiteBase.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AtASuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/test/MahoutLocalContext.scala math-scala/src/test/scala/org/apache/mahout/test/DistributedMahoutSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/blas/AewBSuite.scala spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeOpsSuiteBase.scala
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user avati opened a pull request:

        https://github.com/apache/mahout/pull/29

        MAHOUT-1529: Move dense/sparse matrix test in mapBlock into spark/

        In h2o engine, the Matrix provided to mapBlock() is an instance of
        "H2OBlockMatrix extends AbstractMatrix", and neither a DenseMatrix
        nor SparseMatrix. H2OBlockMatrix is a 0-copy virtual Matrix exposing
        just the partition's data (created at almost no expense), and creates
        a copy-on-write Matrix only if modified by the blockmapfunction.

        So these two tests are failing with h2obindings. Hence moving these two
        tests into spark module.

        Signed-off-by: Anand Avati <avati@redhat.com>

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/avati/mahout MAHOUT-1529e

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/mahout/pull/29.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #29


        commit 1e3cdb68198636c9f38f2d41d782d12edba7a2f7
        Author: Anand Avati <avati@redhat.com>
        Date: 2014-07-15T00:20:09Z

        MAHOUT-1529: Move dense/sparse matrix test in mapBlock into spark/

        In h2o engine, the Matrix provided to mapBlock() is an instance of
        "H2OBlockMatrix extends AbstractMatrix", and neither a DenseMatrix
        nor SparseMatrix. H2OBlockMatrix is a 0-copy virtual Matrix exposing
        just the partition's data (created at almost no expense), and creates
        a copy-on-write Matrix only if modified by the blockmapfunction.

        So these two tests are failing with h2obindings. Hence moving these two
        tests into spark module.

        Signed-off-by: Anand Avati <avati@redhat.com>


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user avati opened a pull request: https://github.com/apache/mahout/pull/29 MAHOUT-1529 : Move dense/sparse matrix test in mapBlock into spark/ In h2o engine, the Matrix provided to mapBlock() is an instance of "H2OBlockMatrix extends AbstractMatrix", and neither a DenseMatrix nor SparseMatrix. H2OBlockMatrix is a 0-copy virtual Matrix exposing just the partition's data (created at almost no expense), and creates a copy-on-write Matrix only if modified by the blockmapfunction. So these two tests are failing with h2obindings. Hence moving these two tests into spark module. Signed-off-by: Anand Avati <avati@redhat.com> You can merge this pull request into a Git repository by running: $ git pull https://github.com/avati/mahout MAHOUT-1529 e Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/29.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #29 commit 1e3cdb68198636c9f38f2d41d782d12edba7a2f7 Author: Anand Avati <avati@redhat.com> Date: 2014-07-15T00:20:09Z MAHOUT-1529 : Move dense/sparse matrix test in mapBlock into spark/ In h2o engine, the Matrix provided to mapBlock() is an instance of "H2OBlockMatrix extends AbstractMatrix", and neither a DenseMatrix nor SparseMatrix. H2OBlockMatrix is a 0-copy virtual Matrix exposing just the partition's data (created at almost no expense), and creates a copy-on-write Matrix only if modified by the blockmapfunction. So these two tests are failing with h2obindings. Hence moving these two tests into spark module. Signed-off-by: Anand Avati <avati@redhat.com>
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user avati commented on the pull request:

        https://github.com/apache/mahout/pull/29#issuecomment-49210898

        @dlyubimov - review/merge appreciated

        Show
        githubbot ASF GitHub Bot added a comment - Github user avati commented on the pull request: https://github.com/apache/mahout/pull/29#issuecomment-49210898 @dlyubimov - review/merge appreciated
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on the pull request:

        https://github.com/apache/mahout/pull/29#issuecomment-49670601

        looks fine to me.

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/29#issuecomment-49670601 looks fine to me.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/mahout/pull/29

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/29
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2707 (See https://builds.apache.org/job/Mahout-Quality/2707/)
        MAHOUT-1529(e): Move dense/sparse matrix test in mapBlock into spark (Anand Avati via dlyubimov) (dlyubimov: rev dec441fb895c96d1e756619d15d75bba00b10fa3)

        • math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeSuiteBase.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala
        • CHANGELOG
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2707 (See https://builds.apache.org/job/Mahout-Quality/2707/ ) MAHOUT-1529 (e): Move dense/sparse matrix test in mapBlock into spark (Anand Avati via dlyubimov) (dlyubimov: rev dec441fb895c96d1e756619d15d75bba00b10fa3) math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeSuiteBase.scala spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala CHANGELOG

          People

          • Assignee:
            dlyubimov Dmitriy Lyubimov
            Reporter:
            dlyubimov Dmitriy Lyubimov
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development