Mahout
  1. Mahout
  2. MAHOUT-874

Extract Writables into a separate module to allow smaller dependencies

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: None
    • Labels:
      None

      Description

      The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies.

      I have a prototype, but it has some funky characteristics which I would like to discuss.

        Activity

        Hide
        Ted Dunning added a comment -

        With a quick slash, the dependencies are down to this

        org.apache.mahout:mahout-math:0.6-SNAPSHOT
        org.apache.hadoop:hadoop-core:0.20.204.0
        commons-cli:commons-cli:1.2
        commons-httpclient:commons-httpclient:3.0.1
        commons-codec:commons-codec:1.4
        commons-configuration:commons-configuration:1.6
        org.codehaus.jackson:jackson-core-asl:1.8.2
        org.codehaus.jackson:jackson-mapper-asl:1.8.2
        org.slf4j:slf4j-api:1.6.1
        org.slf4j:slf4j-jcl:1.6.1
        junit:junit:4.8.2
        

        The number of classes I had to move was a bit surprising. THis is will result in some ugliness in coding because different pieces of packages will need to be in different places.

        The complete list of classes in the writable jar is this:

        src/main/java/org/apache/mahout/cf/taste/hadoop/EntityCountWritable.java
        src/main/java/org/apache/mahout/cf/taste/hadoop/EntityEntityWritable.java
        src/main/java/org/apache/mahout/cf/taste/hadoop/EntityPrefWritable.java
        src/main/java/org/apache/mahout/cf/taste/hadoop/EntityPrefWritableArrayWritable.java
        src/main/java/org/apache/mahout/cf/taste/hadoop/item/PrefAndSimilarityColumnWritable.java
        src/main/java/org/apache/mahout/cf/taste/hadoop/item/VectorAndPrefsWritable.java
        src/main/java/org/apache/mahout/cf/taste/hadoop/item/VectorOrPrefWritable.java
        src/main/java/org/apache/mahout/cf/taste/hadoop/RecommendedItemsWritable.java
        src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericRecommendedItem.java
        src/main/java/org/apache/mahout/cf/taste/recommender/RecommendedItem.java
        src/main/java/org/apache/mahout/classifier/sgd/PolymorphicWritable.java
        src/main/java/org/apache/mahout/clustering/AbstractCluster.java
        src/main/java/org/apache/mahout/clustering/Cluster.java
        src/main/java/org/apache/mahout/clustering/ClusterObservations.java
        src/main/java/org/apache/mahout/clustering/Model.java
        src/main/java/org/apache/mahout/clustering/spectral/common/IntDoublePairWritable.java
        src/main/java/org/apache/mahout/clustering/spectral/common/VertexWritable.java
        src/main/java/org/apache/mahout/clustering/WeightedPropertyVectorWritable.java
        src/main/java/org/apache/mahout/clustering/WeightedVectorWritable.java
        src/main/java/org/apache/mahout/common/ClassUtils.java
        src/main/java/org/apache/mahout/common/IntPairWritable.java
        src/main/java/org/apache/mahout/common/parameters/Parameter.java
        src/main/java/org/apache/mahout/common/parameters/Parametered.java
        src/main/java/org/apache/mahout/graph/linkanalysis/VectorElementWritable.java
        src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/DenseBlockWritable.java
        src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockWritable.java
        src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SplitPartitionedWritable.java
        src/main/java/org/apache/mahout/math/MatrixWritable.java
        src/main/java/org/apache/mahout/math/MultiLabelVectorWritable.java
        src/main/java/org/apache/mahout/math/Varint.java
        src/main/java/org/apache/mahout/math/VarIntWritable.java
        src/main/java/org/apache/mahout/math/VarLongWritable.java
        src/main/java/org/apache/mahout/math/VectorWritable.java
        

        The disappointment is really with the Cluster class. It had to move and that pulled a bunch of other things across.

        What is the sense about this?

        Show
        Ted Dunning added a comment - With a quick slash, the dependencies are down to this org.apache.mahout:mahout-math:0.6-SNAPSHOT org.apache.hadoop:hadoop-core:0.20.204.0 commons-cli:commons-cli:1.2 commons-httpclient:commons-httpclient:3.0.1 commons-codec:commons-codec:1.4 commons-configuration:commons-configuration:1.6 org.codehaus.jackson:jackson-core-asl:1.8.2 org.codehaus.jackson:jackson-mapper-asl:1.8.2 org.slf4j:slf4j-api:1.6.1 org.slf4j:slf4j-jcl:1.6.1 junit:junit:4.8.2 The number of classes I had to move was a bit surprising. THis is will result in some ugliness in coding because different pieces of packages will need to be in different places. The complete list of classes in the writable jar is this: src/main/java/org/apache/mahout/cf/taste/hadoop/EntityCountWritable.java src/main/java/org/apache/mahout/cf/taste/hadoop/EntityEntityWritable.java src/main/java/org/apache/mahout/cf/taste/hadoop/EntityPrefWritable.java src/main/java/org/apache/mahout/cf/taste/hadoop/EntityPrefWritableArrayWritable.java src/main/java/org/apache/mahout/cf/taste/hadoop/item/PrefAndSimilarityColumnWritable.java src/main/java/org/apache/mahout/cf/taste/hadoop/item/VectorAndPrefsWritable.java src/main/java/org/apache/mahout/cf/taste/hadoop/item/VectorOrPrefWritable.java src/main/java/org/apache/mahout/cf/taste/hadoop/RecommendedItemsWritable.java src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericRecommendedItem.java src/main/java/org/apache/mahout/cf/taste/recommender/RecommendedItem.java src/main/java/org/apache/mahout/classifier/sgd/PolymorphicWritable.java src/main/java/org/apache/mahout/clustering/AbstractCluster.java src/main/java/org/apache/mahout/clustering/Cluster.java src/main/java/org/apache/mahout/clustering/ClusterObservations.java src/main/java/org/apache/mahout/clustering/Model.java src/main/java/org/apache/mahout/clustering/spectral/common/IntDoublePairWritable.java src/main/java/org/apache/mahout/clustering/spectral/common/VertexWritable.java src/main/java/org/apache/mahout/clustering/WeightedPropertyVectorWritable.java src/main/java/org/apache/mahout/clustering/WeightedVectorWritable.java src/main/java/org/apache/mahout/common/ClassUtils.java src/main/java/org/apache/mahout/common/IntPairWritable.java src/main/java/org/apache/mahout/common/parameters/Parameter.java src/main/java/org/apache/mahout/common/parameters/Parametered.java src/main/java/org/apache/mahout/graph/linkanalysis/VectorElementWritable.java src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/DenseBlockWritable.java src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockWritable.java src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SplitPartitionedWritable.java src/main/java/org/apache/mahout/math/MatrixWritable.java src/main/java/org/apache/mahout/math/MultiLabelVectorWritable.java src/main/java/org/apache/mahout/math/Varint.java src/main/java/org/apache/mahout/math/VarIntWritable.java src/main/java/org/apache/mahout/math/VarLongWritable.java src/main/java/org/apache/mahout/math/VectorWritable.java The disappointment is really with the Cluster class. It had to move and that pulled a bunch of other things across. What is the sense about this?
        Hide
        Lance Norskog added a comment -

        If you're going to unify clustering and classification, then some cluster-only classes will disappear, right? Perhaps only core writables should be pushed out?

        This is a poster child for using a few weakly typed data structures instead of many strongly typed structures. A cluster is a graph, so use graph-oriented structures instead of custom ones.

        Show
        Lance Norskog added a comment - If you're going to unify clustering and classification, then some cluster-only classes will disappear, right? Perhaps only core writables should be pushed out? This is a poster child for using a few weakly typed data structures instead of many strongly typed structures. A cluster is a graph, so use graph-oriented structures instead of custom ones.
        Hide
        Jake Mannix added a comment -

        If Cluster is bringing in too much, maybe in this first pass, we don't move it over? Keep this new jar/module small for now, and leave as a future JIRA ticket to find a way to extract Cluster out of core and get it into the writable module.

        Realistically, we could keep this jar super tiny to start with (the taste/*Writables, o.a.m.common.*Writable, and o.a.m.math.*Writable), and only pull in more complicated stuff that isn't properly decoupled later.

        Show
        Jake Mannix added a comment - If Cluster is bringing in too much, maybe in this first pass, we don't move it over? Keep this new jar/module small for now, and leave as a future JIRA ticket to find a way to extract Cluster out of core and get it into the writable module. Realistically, we could keep this jar super tiny to start with (the taste/*Writables, o.a.m.common.*Writable, and o.a.m.math.*Writable), and only pull in more complicated stuff that isn't properly decoupled later.
        Hide
        Ted Dunning added a comment -

        Jake, that is reasonable, but I think I would just attack it the other way round by moving Cluster now to the writable jar and then move it back later as possible. The size of the jar is tiny and Cluster doesn't affect the dependencies.

        Show
        Ted Dunning added a comment - Jake, that is reasonable, but I think I would just attack it the other way round by moving Cluster now to the writable jar and then move it back later as possible. The size of the jar is tiny and Cluster doesn't affect the dependencies.
        Hide
        Grant Ingersoll added a comment - - edited

        Why is Cluster even dependent on VectorWritable? Shouldn't it just be dependent on Vector? Seems to me that VectorWritable should only ever be instantiated inside of a Map/Reduce job. All the core stuff should just take Vector.

        Stuff like:

        @Override
          public void observe(VectorWritable x) {
            observe(x.get());
          }
        

        just seems silly. We already have observe(Vector).

        Not that it necessarily solves the problem just yet, but it still strikes me as not needed. Perhaps the same is also true for Model? In fact, could Model be moved to Math? Seems fairly generic and perhaps useful outside of clustering?. Then, we could have ModelWritable which takes care of the Writable part of it.

        Show
        Grant Ingersoll added a comment - - edited Why is Cluster even dependent on VectorWritable? Shouldn't it just be dependent on Vector? Seems to me that VectorWritable should only ever be instantiated inside of a Map/Reduce job. All the core stuff should just take Vector. Stuff like: @Override public void observe(VectorWritable x) { observe(x.get()); } just seems silly. We already have observe(Vector). Not that it necessarily solves the problem just yet, but it still strikes me as not needed. Perhaps the same is also true for Model? In fact, could Model be moved to Math? Seems fairly generic and perhaps useful outside of clustering?. Then, we could have ModelWritable which takes care of the Writable part of it.
        Hide
        Ted Dunning added a comment -

        Dependency on *Writable isn't a problem.

        The problem is that Clusters are writable or that certain writables depend on them.

        See Model.

        Show
        Ted Dunning added a comment - Dependency on *Writable isn't a problem. The problem is that Clusters are writable or that certain writables depend on them. See Model.
        Hide
        Jake Mannix added a comment -

        Hey Ted,

        Is there a way we can revive this / get this in shape? This issue is blocking getting Mahout integrated with some projects we have (that don't want all of mahout-core's baggage, but want writables + math). Is this patch out of date?

        Show
        Jake Mannix added a comment - Hey Ted, Is there a way we can revive this / get this in shape? This issue is blocking getting Mahout integrated with some projects we have (that don't want all of mahout-core's baggage, but want writables + math). Is this patch out of date?
        Hide
        Sean Owen added a comment -

        Is this purely an issue of the size of your resulting jar? You can do this more effectively with a one liner with proguard in your build. I imagine its convoluted to pull out the Writables and is going to make everyone else need two jars where there was one.

        Show
        Sean Owen added a comment - Is this purely an issue of the size of your resulting jar? You can do this more effectively with a one liner with proguard in your build. I imagine its convoluted to pull out the Writables and is going to make everyone else need two jars where there was one.
        Hide
        Ted Dunning added a comment -

        I am sure that the patch is out of date and my git repo is a much easier
        place to get a coherent change.

        My problem is that this drops the Mahout size to a few 10's of K, but it
        doesn't get rid of the dependencies which bloat the package back to about
        10MB. See this, for instance,

        $ pwd
        /Users/tdunning/Apache/mahout/writables
        $ du -sh target/.jar*
        9.8M target/mahout-writables-0.6-SNAPSHOT-jar-with-dependencies.jar

        • 48K target/mahout-writables-0.6-SNAPSHOT-sources.jar*
        • 60K target/mahout-writables-0.6-SNAPSHOT.jar*

        Would this even make a difference to you?

        On Mon, Dec 19, 2011 at 10:37 AM, Jake Mannix (Commented) (JIRA) <

        Show
        Ted Dunning added a comment - I am sure that the patch is out of date and my git repo is a much easier place to get a coherent change. My problem is that this drops the Mahout size to a few 10's of K, but it doesn't get rid of the dependencies which bloat the package back to about 10MB. See this, for instance, $ pwd /Users/tdunning/Apache/mahout/writables $ du -sh target/ .jar* 9.8M target/mahout-writables-0.6-SNAPSHOT-jar-with-dependencies.jar 48K target/mahout-writables-0.6-SNAPSHOT-sources.jar* 60K target/mahout-writables-0.6-SNAPSHOT.jar* Would this even make a difference to you? On Mon, Dec 19, 2011 at 10:37 AM, Jake Mannix (Commented) (JIRA) <
        Hide
        Ted Dunning added a comment -

        Hmm....

        Looking at this again, the biggest dependency is Hadoop. Presumably, that
        will be available in your cluster.

        Show
        Ted Dunning added a comment - Hmm.... Looking at this again, the biggest dependency is Hadoop. Presumably, that will be available in your cluster.
        Hide
        Jake Mannix added a comment -

        Yes, the primary problem is that of jar-hell, and transitive dependencies. Mahout-math depends on very little that it really needs (other than guava) - both commons-math and uncommons-math are only used in a few places, and can be <exclude>'ed from ivy/maven imports for most apps. Once you go to mahout-core, the list of dependencies grows pretty huge, and keeping track of how long your exclude list is can be unweildy.

        So it's not the size, per se, but the stuff that gets pulled in. Any maven artifact which can be included with just a few <exclude>hadoop</exclude> bits and yet still only bring in just a few things would make it much easier to convince other teams to pull this in.

        Show
        Jake Mannix added a comment - Yes, the primary problem is that of jar-hell, and transitive dependencies. Mahout-math depends on very little that it really needs (other than guava) - both commons-math and uncommons-math are only used in a few places, and can be <exclude>'ed from ivy/maven imports for most apps. Once you go to mahout-core, the list of dependencies grows pretty huge, and keeping track of how long your exclude list is can be unweildy. So it's not the size, per se, but the stuff that gets pulled in. Any maven artifact which can be included with just a few <exclude>hadoop</exclude> bits and yet still only bring in just a few things would make it much easier to convince other teams to pull this in.
        Hide
        Sean Owen added a comment -

        Separating out a few classes won't change what they depend on, and won't cause you to need any more or fewer classes at runtime. Your jar hell is the same.
        Is the issue Maven packaging all the transitive dependencies? If that's your issue then again, a run through Proguard (with properly configured entry points) will strip out not just the Mahout code you don't use but anything else you don't use. I think that is maybe the better solution to the particular issue you face? these things otherwise seem pretty "core" and live where they should live for the general user.

        Show
        Sean Owen added a comment - Separating out a few classes won't change what they depend on, and won't cause you to need any more or fewer classes at runtime. Your jar hell is the same. Is the issue Maven packaging all the transitive dependencies? If that's your issue then again, a run through Proguard (with properly configured entry points) will strip out not just the Mahout code you don't use but anything else you don't use. I think that is maybe the better solution to the particular issue you face? these things otherwise seem pretty "core" and live where they should live for the general user.
        Hide
        Jake Mannix added a comment -

        The *Writables depend on very little other than core Mahout classes (internal) and Hadoop. The runtime jarhell is completely minimized. Look at the list of dependencies the *Writables would depend on compared to the mahout-core package.

        Just as mahout-math is totally "core" to what we do, it's still really nice that it's in its own jar with very minimal external dependencies. mahout-writables could be equally slim and non-dependent, and allow for a jar which lets people read/write wire-compatible data with us without depending on everything that mahout-core pulls in.

        Re: Progaurd I don't think I have much say in changing the way our build system works. We use ivy, and I can depend on stuff from maven repos, and put in exclude statements, but that's about it. This is very similar to other places I've worked, as this is a pretty common issue.

        Show
        Jake Mannix added a comment - The *Writables depend on very little other than core Mahout classes (internal) and Hadoop. The runtime jarhell is completely minimized. Look at the list of dependencies the *Writables would depend on compared to the mahout-core package. Just as mahout-math is totally "core" to what we do, it's still really nice that it's in its own jar with very minimal external dependencies. mahout-writables could be equally slim and non-dependent, and allow for a jar which lets people read/write wire-compatible data with us without depending on everything that mahout-core pulls in. Re: Progaurd I don't think I have much say in changing the way our build system works. We use ivy, and I can depend on stuff from maven repos, and put in exclude statements, but that's about it. This is very similar to other places I've worked, as this is a pretty common issue.
        Hide
        Sean Owen added a comment -

        What all classes in core depend on doesn't matter, if you are only using the Writables. Then, it only matters what the Writable classes depend on; unused classes are never loaded and have no effect. But then, it means depending on a jar of just the Writables doesn't change what you need at runtime, so what does this help for your use case? I assume it's not a runtime issue then.

        It's something to do with Maven output? But what real problem is that causing... the availability of dependencies doesn't harm anything. It makes the job file bigger. But are you deploying in a case where a couple megs in a jar file matters? The only case I've seen where it matters is mobile apps, these days, and you say you don't want to use Proguard. Ted's indicating it doesn't save much.

        Why is <exclude> so bad, this seems like what it's for. core has 12 third-party dependencies, and won't move much. That's not so bad even if you wanted to exclude each one. You could create your own (internal) artifact that is just "core, stripping the dependencies we don't want" that everyone can depend on.

        mahout-math doesn't depend on mahout-core. I think you're proposing a circular dependency here . Which is possible. But that is symptomatic of the difference. I suppose you can start looking at severing more dependencies and breaking out even more sub-modules; now users need to figure out which of 3, 4, 5 jars are needed.

        I don't doubt this solves your problem, just asking whether it solves a more general need, since it is going to create small additional work for all other consumers or core.

        Or: don't we actually need some code surgery around Cluster to actually accomplish what you want anyway? or else it ends up depending on core anyway.

        Show
        Sean Owen added a comment - What all classes in core depend on doesn't matter, if you are only using the Writables. Then, it only matters what the Writable classes depend on; unused classes are never loaded and have no effect. But then, it means depending on a jar of just the Writables doesn't change what you need at runtime, so what does this help for your use case? I assume it's not a runtime issue then. It's something to do with Maven output? But what real problem is that causing... the availability of dependencies doesn't harm anything. It makes the job file bigger. But are you deploying in a case where a couple megs in a jar file matters? The only case I've seen where it matters is mobile apps, these days, and you say you don't want to use Proguard. Ted's indicating it doesn't save much. Why is <exclude> so bad, this seems like what it's for. core has 12 third-party dependencies, and won't move much. That's not so bad even if you wanted to exclude each one. You could create your own (internal) artifact that is just "core, stripping the dependencies we don't want" that everyone can depend on. mahout-math doesn't depend on mahout-core. I think you're proposing a circular dependency here . Which is possible. But that is symptomatic of the difference. I suppose you can start looking at severing more dependencies and breaking out even more sub-modules; now users need to figure out which of 3, 4, 5 jars are needed. I don't doubt this solves your problem, just asking whether it solves a more general need, since it is going to create small additional work for all other consumers or core. Or: don't we actually need some code surgery around Cluster to actually accomplish what you want anyway? or else it ends up depending on core anyway.
        Hide
        Jake Mannix added a comment -

        I'm not proposing that mahout-math depend on mahout-core. Where did I say that? mahout-core depends on mahout-math depends on mahout-collections. I'm suggesting we have mahout-core depend on both mahout-writables and mahout-math which depend on mahout-collections.

        So in theory, yes, putting a bunch of <exclude> for every dep in core that isn't used, that can work. But is ugly, and the writable package, if it existed, could be depended on in other open source projects which wanted to be wire compatible with us. Example case in point: elephant-bird is one of twitter's open source hadoop utils projects. It doesn't want to depend on all of mahout, but would like to be able to load mahout vectorwritables etc, and then turn those into, say, a pig script.

        Show
        Jake Mannix added a comment - I'm not proposing that mahout-math depend on mahout-core. Where did I say that? mahout-core depends on mahout-math depends on mahout-collections. I'm suggesting we have mahout-core depend on both mahout-writables and mahout-math which depend on mahout-collections. So in theory, yes, putting a bunch of <exclude> for every dep in core that isn't used, that can work. But is ugly, and the writable package, if it existed, could be depended on in other open source projects which wanted to be wire compatible with us. Example case in point: elephant-bird is one of twitter's open source hadoop utils projects. It doesn't want to depend on all of mahout, but would like to be able to load mahout vectorwritables etc, and then turn those into, say, a pig script.
        Hide
        Sean Owen added a comment -

        That's not what I meant – you were drawing a comparison to mahout-math vs mahout-core. I was saying it didn't seem like quite the same thing, since as I understand the change, the new module still depends on core. Do I misunderstand, since if true, this really wouldn't change anything? I thought Ted was pointing out that to actually make headway, and cut the pointer to core, there is additional code surgery needed around Cluster.

        I guess I am still missing what's wrong with "depending on all of mahout-core". Have you seen the tree that Hadoop brings in – has it ever mattered?
        I know I am asking a dumb question, but I am still not clear: is it the size of a jarred up file of all transitive dependencies that is at issue? But forget the question of whether it matters; it doesn't matter to me but wouldn't mean I would object to such a change if even a few people wanted it.

        My real question is just whether this is solving the problem it's supposed to solve. If the question is one of run-time dependencies, this change will not make any difference, so I would not see a reason to make it. If it's a question of Maven/compile-time dependency, then as I understand this still doesn't solve something due to a lingering dependence on core via cluster. (I may misunderstand.) In which case I would merely say there needs to be ground-work done, that hasn't been done, and that's what should be posted as a patch and discussed next!

        Show
        Sean Owen added a comment - That's not what I meant – you were drawing a comparison to mahout-math vs mahout-core. I was saying it didn't seem like quite the same thing, since as I understand the change, the new module still depends on core. Do I misunderstand, since if true, this really wouldn't change anything? I thought Ted was pointing out that to actually make headway, and cut the pointer to core, there is additional code surgery needed around Cluster. I guess I am still missing what's wrong with "depending on all of mahout-core". Have you seen the tree that Hadoop brings in – has it ever mattered? I know I am asking a dumb question, but I am still not clear: is it the size of a jarred up file of all transitive dependencies that is at issue? But forget the question of whether it matters; it doesn't matter to me but wouldn't mean I would object to such a change if even a few people wanted it. My real question is just whether this is solving the problem it's supposed to solve. If the question is one of run-time dependencies, this change will not make any difference, so I would not see a reason to make it. If it's a question of Maven/compile-time dependency, then as I understand this still doesn't solve something due to a lingering dependence on core via cluster. (I may misunderstand.) In which case I would merely say there needs to be ground-work done, that hasn't been done, and that's what should be posted as a patch and discussed next!
        Hide
        Ted Dunning added a comment -

        Let me see what might be done. Currently, I have the lazy way of doing
        things and build the jar-with-dependencies without an explicit assembly.
        As it is, therefore, there is less expressivity available than there might
        be with respect to things like excludes, but I don't know the assembly
        plugin all that well.

        I will see what is easy to do.

        On Mon, Dec 19, 2011 at 3:59 PM, Jake Mannix (Commented) (JIRA) <

        Show
        Ted Dunning added a comment - Let me see what might be done. Currently, I have the lazy way of doing things and build the jar-with-dependencies without an explicit assembly. As it is, therefore, there is less expressivity available than there might be with respect to things like excludes, but I don't know the assembly plugin all that well. I will see what is easy to do. On Mon, Dec 19, 2011 at 3:59 PM, Jake Mannix (Commented) (JIRA) <
        Hide
        Jake Mannix added a comment -

        Yes, I was meaning that mahout-writables would not depend on mahout-core. If that requires further headway around Cluster (or just leaving the ClusterWritables back in core, and pulling them out later when the surgery is complete), then so be it.

        The dependency on hadoop is huge, yes, but if we're running on hadoop (which would be the case if you have mahout-writable as the package in question), then you already depend on that, that's a given.

        It is not the question of jar size in MB which matters here, no. The question is of runtime dependencies, and I guess we're just missing understanding each other because I'm not pushing on the original git branch Ted made, but instead the end goal of what would happen once cluster was removed. Yes, the work that should be patched next, in my view, actually, is to post what you get if you pull out all of the easy *Writables (ie. everything except Cluster, I guess?) as a first pass, leaving cluster back in core.

        I would personally think that was a positive first step, a) creating a place for writables to go, moving forward, and b) providing a dependency which knew how to deal with many of the common serialized objects of mahout. Step 2 would be to work further iterations around getting all remaining Writables out of core and into this new package.

        I don't think Step 1 and 2 need to be done at the same time, however.

        Show
        Jake Mannix added a comment - Yes, I was meaning that mahout-writables would not depend on mahout-core. If that requires further headway around Cluster (or just leaving the ClusterWritables back in core, and pulling them out later when the surgery is complete), then so be it. The dependency on hadoop is huge, yes, but if we're running on hadoop (which would be the case if you have mahout-writable as the package in question), then you already depend on that, that's a given. It is not the question of jar size in MB which matters here, no. The question is of runtime dependencies, and I guess we're just missing understanding each other because I'm not pushing on the original git branch Ted made, but instead the end goal of what would happen once cluster was removed. Yes, the work that should be patched next, in my view, actually, is to post what you get if you pull out all of the easy *Writables (ie. everything except Cluster, I guess?) as a first pass, leaving cluster back in core. I would personally think that was a positive first step, a) creating a place for writables to go, moving forward, and b) providing a dependency which knew how to deal with many of the common serialized objects of mahout. Step 2 would be to work further iterations around getting all remaining Writables out of core and into this new package. I don't think Step 1 and 2 need to be done at the same time, however.
        Hide
        Ted Dunning added a comment -

        On Mon, Dec 19, 2011 at 8:41 PM, Jake Mannix (Commented) (JIRA) <

        There are also EntityCountWritable, EntityEntityWritable,
        EntityPrefWritable, EntityPrefWritableArrayWritable,
        RecommendedItemsWritable, PrefAndSimilarityColumnWritable,
        VectorAndPrefsWritable, VectorOrPrefWritable.

        The dependency on hadoop is huge, yes, but if we're running on hadoop

        I think that including core but not hadoop might do the trick even so.
        Suddenly it occurs to me that the right way to deal with this is to use
        the provided scope.

        I only used the jar size in MB as a measure of how large the transitive
        dependencies actually are.

        Show
        Ted Dunning added a comment - On Mon, Dec 19, 2011 at 8:41 PM, Jake Mannix (Commented) (JIRA) < There are also EntityCountWritable, EntityEntityWritable, EntityPrefWritable, EntityPrefWritableArrayWritable, RecommendedItemsWritable, PrefAndSimilarityColumnWritable, VectorAndPrefsWritable, VectorOrPrefWritable. The dependency on hadoop is huge, yes, but if we're running on hadoop I think that including core but not hadoop might do the trick even so. Suddenly it occurs to me that the right way to deal with this is to use the provided scope. I only used the jar size in MB as a measure of how large the transitive dependencies actually are.
        Hide
        Ted Dunning added a comment -

        OK. Putting hadoop in as "provided" reduces the size of all of the
        dependencies to 3.8MB. Eliminating slf4j drops this to 3.7MB. Eliminating
        mahout-math drops this to 60KB.

        Ergo, mahout-math is by far the tall pole and roughly 3.8MB is the
        reasonable minimum for the transitive dependencies. This is not all that
        bad and is a lot better than the 20MB that we started with.

        On Mon, Dec 19, 2011 at 9:39 PM, Ted Dunning (Commented) (JIRA) <

        Show
        Ted Dunning added a comment - OK. Putting hadoop in as "provided" reduces the size of all of the dependencies to 3.8MB. Eliminating slf4j drops this to 3.7MB. Eliminating mahout-math drops this to 60KB. Ergo, mahout-math is by far the tall pole and roughly 3.8MB is the reasonable minimum for the transitive dependencies. This is not all that bad and is a lot better than the 20MB that we started with. On Mon, Dec 19, 2011 at 9:39 PM, Ted Dunning (Commented) (JIRA) <
        Hide
        Sean Owen added a comment -

        Ah, there's a 'provided' scope? That would be great since no use case we support involves "bringing our own" Hadoop. It's needed to compile only. That would probably reduce the job jar size too, and probably avoid some problems.

        So just that one change to core makes it that much smaller? Surely 3.8MB and a few transitive dependencies is pretty OK to bring in to anything?

        Show
        Sean Owen added a comment - Ah, there's a 'provided' scope? That would be great since no use case we support involves "bringing our own" Hadoop. It's needed to compile only. That would probably reduce the job jar size too, and probably avoid some problems. So just that one change to core makes it that much smaller? Surely 3.8MB and a few transitive dependencies is pretty OK to bring in to anything?
        Hide
        Ted Dunning added a comment -

        Jake,

        Can you confirm that changing Hadoop to provided solved this for you?

        I would like to mark this as fixed.

        Show
        Ted Dunning added a comment - Jake, Can you confirm that changing Hadoop to provided solved this for you? I would like to mark this as fixed.
        Hide
        Jake Mannix added a comment -

        So marking hadoop as provided is nice, a smaller jar is great, but what I as I mentioned above, the size was never my primary concern, it was the dependency graph: It's really nice that mahout-math is a nice little non-hadoop-depending package which just does stats, linear algebra, and ml which don't have to think about hadoop stuff, even for compile time. -core is big, because it's what mahout "is". What I has been wanting is something a little in between, that depends on hadoop (but with provided scope), and mahout-math, but has the writables so that someone can work with mahout data inputs/outputs without actually linking to -core.

        Essentially, it's the distinction between a "mahout-api" vs "mahout-impl" package. Since our "API" is file-format, the "mahout-api" module is really just the set of writables needed to be able to marshall/unmarshall our binary data.

        Show
        Jake Mannix added a comment - So marking hadoop as provided is nice, a smaller jar is great, but what I as I mentioned above, the size was never my primary concern, it was the dependency graph: It's really nice that mahout-math is a nice little non-hadoop-depending package which just does stats, linear algebra, and ml which don't have to think about hadoop stuff, even for compile time. -core is big, because it's what mahout "is". What I has been wanting is something a little in between, that depends on hadoop (but with provided scope), and mahout-math, but has the writables so that someone can work with mahout data inputs/outputs without actually linking to -core. Essentially, it's the distinction between a "mahout-api" vs "mahout-impl" package. Since our "API" is file-format, the "mahout-api" module is really just the set of writables needed to be able to marshall/unmarshall our binary data.
        Hide
        Sebastian Schelter added a comment -

        Closing this issue as there has not been activity for more than half a year. The topic is still important though, if someone wants to restart work on that, this issue can be reopened.

        Show
        Sebastian Schelter added a comment - Closing this issue as there has not been activity for more than half a year. The topic is still important though, if someone wants to restart work on that, this issue can be reopened.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ted Dunning
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development