Mahout
  1. Mahout
  2. MAHOUT-537

Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.4, 0.5
    • Fix Version/s: None
    • Component/s: Math
    • Labels:
      None

      Description

      Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

      1. MAHOUT-537_hack.patch
        155 kB
        Shannon Quinn
      2. MAHOUT-537.patch
        43 kB
        Shannon Quinn
      3. MAHOUT-537.patch
        45 kB
        Shannon Quinn
      4. MAHOUT-537.patch
        30 kB
        Shannon Quinn
      5. MAHOUT-537.patch
        27 kB
        Shannon Quinn

        Activity

        Hide
        Shannon Quinn added a comment -

        This first patch fixes most of the DistributedRowMatrix compliance issues, bringing TimesJob, TimesSquaredJob, and TransposeJob to work with the new Hadoop 0.20.2 API, most notably by eliminating dependence on the deprecated JobConf object, using instead Job and Configuration. DistributedRowMatrix has also been modified to accept a Configuration object in its configure() method. All that remains is to fix the times(DRM) method, which will be tricky given its dependence on the CompositeInputFormat object which is conspicuously absent any analogous type in the latest Hadoop release. Will update.

        Show
        Shannon Quinn added a comment - This first patch fixes most of the DistributedRowMatrix compliance issues, bringing TimesJob, TimesSquaredJob, and TransposeJob to work with the new Hadoop 0.20.2 API, most notably by eliminating dependence on the deprecated JobConf object, using instead Job and Configuration. DistributedRowMatrix has also been modified to accept a Configuration object in its configure() method. All that remains is to fix the times(DRM) method, which will be tricky given its dependence on the CompositeInputFormat object which is conspicuously absent any analogous type in the latest Hadoop release. Will update.
        Hide
        Shannon Quinn added a comment -

        Something worth discussing: Hadoop just released version 0.21.0, which re-includes the updated CompositeInputFormat that was missing in 0.20.2 and deprecated in 0.18. I'm going to install v0.21 and see if tests pass on the trunk, but provided they do then I'm wondering if I should go ahead and implement this patch using Hadoop 0.21. Any thoughts?

        Show
        Shannon Quinn added a comment - Something worth discussing: Hadoop just released version 0.21.0, which re-includes the updated CompositeInputFormat that was missing in 0.20.2 and deprecated in 0.18. I'm going to install v0.21 and see if tests pass on the trunk, but provided they do then I'm wondering if I should go ahead and implement this patch using Hadoop 0.21. Any thoughts?
        Hide
        Jeff Eastman added a comment -

        Tried out the patch. It applied cleanly and, after adding a throws declaration, compiles. It still has some commented out old code that could be removed but otherwise looks reasonable. You are making progress on a tough problem. This is still a WIP of course and a few unit tests are failing. It will be interesting to hear of your 0.21 experiments.

        Show
        Jeff Eastman added a comment - Tried out the patch. It applied cleanly and, after adding a throws declaration, compiles. It still has some commented out old code that could be removed but otherwise looks reasonable. You are making progress on a tough problem. This is still a WIP of course and a few unit tests are failing. It will be interesting to hear of your 0.21 experiments.
        Hide
        Shannon Quinn added a comment -

        Updated patch. Fixes from previous patch are included, this time merged with unrelated changes to the related files. Also removed all the commented-out old code, and even caught and fixed a few bugs. Fully implemented timesSquared(). All that remains is the times(DRM) job. Will update on this very soon.

        (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)

        Show
        Shannon Quinn added a comment - Updated patch. Fixes from previous patch are included, this time merged with unrelated changes to the related files. Also removed all the commented-out old code, and even caught and fixed a few bugs. Fully implemented timesSquared(). All that remains is the times(DRM) job. Will update on this very soon. (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)
        Hide
        Shannon Quinn added a comment -

        Matrix-matrix multiplication is now implemented, via somewhat of a hack job with the NamedVector class and corresponding m/r job. This hasn't been tested yet (that is the next step). It does compile, so in theory all that remains is to adjust the DRM unit tests to accommodate Hadoop 0.20.2.

        Show
        Shannon Quinn added a comment - Matrix-matrix multiplication is now implemented, via somewhat of a hack job with the NamedVector class and corresponding m/r job. This hasn't been tested yet (that is the next step). It does compile, so in theory all that remains is to adjust the DRM unit tests to accommodate Hadoop 0.20.2.
        Hide
        Joris Geessels added a comment -

        The matrix matrix multiplication seems like an ugly hack to me, I'm actually in favor to keep using the old API until we can switch to 21.0.
        Some remarks:
        1) I didn't test the code either, but couldn't spot any obvious errors. So it seems to me that it should work.
        2 ) This implementation uses 3 M/R jobs where the original one has only 1. I agree that the first 2 two jobs are very basic operations, but still for performance's sake it's better to keep the amount of jobs low. I'm almost 100% certain that this implementation will be slower than the original one ( though I have no idea how much slower, would be interesting to know )
        3 ) Every row of the DRM now has an extra String variable to store and send. Certainly when the matrix is very sparse this will result in a substantial overhead.
        4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's no reason for this. It would be better to use a plain VectorWritable.

        If we insist in compliance with 20.2, it might be interesting to have a look at:
        http://homepage.mac.com/j.norstad/matrix-multiply/index.html
        This implementation avoids the use of compositeinputformat by checking the current inputpath in the setup.

        Some more general remarks: I think the matrix multiplication can be implemented more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix with around 35 million elements on a quite powerful cluster of 10 nodes, and this took around 30 minutes. I have no idea of the performance of the implementation described at http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really compare. But Imho this can be improved ( though it's possible that the poor performance was due to mistakes made by me )

        Show
        Joris Geessels added a comment - The matrix matrix multiplication seems like an ugly hack to me, I'm actually in favor to keep using the old API until we can switch to 21.0. Some remarks: 1) I didn't test the code either, but couldn't spot any obvious errors. So it seems to me that it should work. 2 ) This implementation uses 3 M/R jobs where the original one has only 1. I agree that the first 2 two jobs are very basic operations, but still for performance's sake it's better to keep the amount of jobs low. I'm almost 100% certain that this implementation will be slower than the original one ( though I have no idea how much slower, would be interesting to know ) 3 ) Every row of the DRM now has an extra String variable to store and send. Certainly when the matrix is very sparse this will result in a substantial overhead. 4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's no reason for this. It would be better to use a plain VectorWritable. If we insist in compliance with 20.2, it might be interesting to have a look at: http://homepage.mac.com/j.norstad/matrix-multiply/index.html This implementation avoids the use of compositeinputformat by checking the current inputpath in the setup. Some more general remarks: I think the matrix multiplication can be implemented more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix with around 35 million elements on a quite powerful cluster of 10 nodes, and this took around 30 minutes. I have no idea of the performance of the implementation described at http://homepage.mac.com/j.norstad/matrix-multiply/index.html , so I can't really compare. But Imho this can be improved ( though it's possible that the poor performance was due to mistakes made by me )
        Hide
        Sean Owen added a comment -

        I think this is a great effort. I think it's essential that the project remain attached to 0.20.2 at the moment because I believe many people will want to use it with Amazon EMR which is on 0.20.2. We still have some stuff written for 0.19.x and it's higher priority to move off that than onto 0.21.x I think. Complicating this is the fact that 0.21.x is not backward compatible with 0.20.x.

        NamedVector is already supported in VectorWritable, do we need a new Writable?

        Is the issue that you are doing joins? Without CompositeInputFormat it's still possible, and we use the pattern elsewhere. You need some cleverness with a custom key and partitioner that will send key x from source A and key x from source B to the same reducer while maintaining inside a bit that indicates whether it's from A or B.

        Show
        Sean Owen added a comment - I think this is a great effort. I think it's essential that the project remain attached to 0.20.2 at the moment because I believe many people will want to use it with Amazon EMR which is on 0.20.2. We still have some stuff written for 0.19.x and it's higher priority to move off that than onto 0.21.x I think. Complicating this is the fact that 0.21.x is not backward compatible with 0.20.x. NamedVector is already supported in VectorWritable, do we need a new Writable? Is the issue that you are doing joins? Without CompositeInputFormat it's still possible, and we use the pattern elsewhere. You need some cleverness with a custom key and partitioner that will send key x from source A and key x from source B to the same reducer while maintaining inside a bit that indicates whether it's from A or B.
        Hide
        Shannon Quinn added a comment -

        Attached is the patch without the custom Writable I wrote, instead using NamedVector.

        It seems (to me) that there are two options for eliminating the two extra M/R tasks I had to create in lieu of the CompositeInputFormat's joins:

        1) Have each row of a DistributedRowMatrix labeled when it is first created. Since DRM isn't much more than a glorified wrapper, its constructor can't implement something like this, so this would be infeasible from a scope perspective.
        2) Guarantee the ordering of two given rows in the Iterable object of a Combiner/Reducer, so we know one of them belongs to the multiplicand, the other to the multiplier.

        Option #2 seems most technically feasible, however my limited understanding of the inner workings of Hadoop prevents me from knowing where to start. I've taken a look at Partitioner, RecordReader, and various InputFormats and they haven't given me any intuition. Any thoughts on how to do this? Or another method entirely?

        Show
        Shannon Quinn added a comment - Attached is the patch without the custom Writable I wrote, instead using NamedVector. It seems (to me) that there are two options for eliminating the two extra M/R tasks I had to create in lieu of the CompositeInputFormat's joins: 1) Have each row of a DistributedRowMatrix labeled when it is first created. Since DRM isn't much more than a glorified wrapper, its constructor can't implement something like this, so this would be infeasible from a scope perspective. 2) Guarantee the ordering of two given rows in the Iterable object of a Combiner/Reducer, so we know one of them belongs to the multiplicand, the other to the multiplier. Option #2 seems most technically feasible, however my limited understanding of the inner workings of Hadoop prevents me from knowing where to start. I've taken a look at Partitioner, RecordReader, and various InputFormats and they haven't given me any intuition. Any thoughts on how to do this? Or another method entirely?
        Hide
        Sean Owen added a comment -

        This seems to me like part of one of the most crucial tasks for the next release: deprecating and then removing or fixing anything not using the newer Hadoop APIs. Shannon did you reach a point you can commit? I'd strongly encourage you to finish this migration, it's great and important work.

        Show
        Sean Owen added a comment - This seems to me like part of one of the most crucial tasks for the next release: deprecating and then removing or fixing anything not using the newer Hadoop APIs. Shannon did you reach a point you can commit? I'd strongly encourage you to finish this migration, it's great and important work.
        Hide
        Shannon Quinn added a comment -

        The patch has been ready to go since I posted it, but our original consensus based on the limitations of 0.20 (which haven't changed) are what kept this patch in limbo: namely that 0.20 conveniently leaves out a crucial data type, the absence of which requires 3 M/R passes to do the matrix-matrix multiplication, whereas in 0.18 and 0.21-where this type is present-requires only 1 pass.

        In your last post, however, you alluded to some cleverness in doing joins and customizing the partitioner that I never did get the details on. Would you mind expounding on that? I scoured through every 0.20 format type and type manager I could find and didn't see anything promising, so your more experienced perspective would be most helpful.

        Show
        Shannon Quinn added a comment - The patch has been ready to go since I posted it, but our original consensus based on the limitations of 0.20 (which haven't changed) are what kept this patch in limbo: namely that 0.20 conveniently leaves out a crucial data type, the absence of which requires 3 M/R passes to do the matrix-matrix multiplication, whereas in 0.18 and 0.21- where this type is present -requires only 1 pass. In your last post, however, you alluded to some cleverness in doing joins and customizing the partitioner that I never did get the details on. Would you mind expounding on that? I scoured through every 0.20 format type and type manager I could find and didn't see anything promising, so your more experienced perspective would be most helpful.
        Hide
        Sean Owen added a comment -

        I could, though honestly, I think the better solution at this point is to move to Hadoop 0.21 as part of the next release. It is the current release and nearly superseded by 0.22. It has some features we need to move forward. It is closer to what many are using in CDH3/4. The only drawback I see is that Amazon EMR is on 0.20.2. However we're releasing 0.5 now for 0.20.2. And it is 6 months until we would put out a release needing 0.21, after which time I imagine 0.22 is out and EMR makes available 0.21 – or if it doesn't, we have to leave behind support.

        So let me open an item for that, and I suggest you can proceed using 0.21 features here.
        (That is what I am doing for personal projects and it really simplified things. I'm on 0.22 now myself.)

        Show
        Sean Owen added a comment - I could, though honestly, I think the better solution at this point is to move to Hadoop 0.21 as part of the next release. It is the current release and nearly superseded by 0.22. It has some features we need to move forward. It is closer to what many are using in CDH3/4. The only drawback I see is that Amazon EMR is on 0.20.2. However we're releasing 0.5 now for 0.20.2. And it is 6 months until we would put out a release needing 0.21, after which time I imagine 0.22 is out and EMR makes available 0.21 – or if it doesn't, we have to leave behind support. So let me open an item for that, and I suggest you can proceed using 0.21 features here. (That is what I am doing for personal projects and it really simplified things. I'm on 0.22 now myself.)
        Hide
        Jake Mannix added a comment -

        Does 0.21 bring back map-side joins and multiple outputs? We don't use 0.21 in production at Twitter, and I know tons of other places who haven't migrated up yet either.

        I think we should probably have a more in-depth discussion about which Hadoop releases we support.

        Show
        Jake Mannix added a comment - Does 0.21 bring back map-side joins and multiple outputs? We don't use 0.21 in production at Twitter, and I know tons of other places who haven't migrated up yet either. I think we should probably have a more in-depth discussion about which Hadoop releases we support.
        Hide
        Dmitriy Lyubimov added a comment - - edited

        Second Jake.

        I think the better solution at this point is to move to Hadoop 0.21 as part of the next release.

        -1 on this yet. (if i can recollect, Ted had concern about this move as well).

        At the risk sounding like a stuck record, nobody is using 0.21 that i know. 0.21 is not production grade which was recognized even by the Hadoop team.

        It is true 0.21 is a superset of CDH but it potentially has stuff CDH doesn't have so using 0.21 does not guarantee everything will work with CDH and it almost certainly guarantees nothing will work for bulk stuff on EMR.

        We use both EMR and CDH. If you puff up the dependencies, as things are now, it will absolutely preclude us from using further versions of Mahout. I probably could maneuver some code that we use with CDH to verify it still works with CDH but not en masse. If i really wanted to use some of such migrated algorithms and take advantage of various fixes, i would have to create massive private hacks to keep it working (similar to what Cloudera does). Which we probably don't have capacity to do, so i'll just have to drop using trunk or future Mahout distributions until better times.

        I know for sure we will never use 0.21 they way it is released.

        There's probably more hope for new generation of hadoop that would combine ability to run old MR or new MR or something else. In fact, I am looking forward to porting and using that future Hadoop generation work as it would allow to scrap many unnecessary limitations of MR for parallel use that are holding up performance on many algorithms (esp. lin alg algorithms).

        Show
        Dmitriy Lyubimov added a comment - - edited Second Jake. I think the better solution at this point is to move to Hadoop 0.21 as part of the next release. -1 on this yet. (if i can recollect, Ted had concern about this move as well). At the risk sounding like a stuck record, nobody is using 0.21 that i know. 0.21 is not production grade which was recognized even by the Hadoop team. It is true 0.21 is a superset of CDH but it potentially has stuff CDH doesn't have so using 0.21 does not guarantee everything will work with CDH and it almost certainly guarantees nothing will work for bulk stuff on EMR. We use both EMR and CDH. If you puff up the dependencies, as things are now, it will absolutely preclude us from using further versions of Mahout. I probably could maneuver some code that we use with CDH to verify it still works with CDH but not en masse. If i really wanted to use some of such migrated algorithms and take advantage of various fixes, i would have to create massive private hacks to keep it working (similar to what Cloudera does). Which we probably don't have capacity to do, so i'll just have to drop using trunk or future Mahout distributions until better times. I know for sure we will never use 0.21 they way it is released. There's probably more hope for new generation of hadoop that would combine ability to run old MR or new MR or something else. In fact, I am looking forward to porting and using that future Hadoop generation work as it would allow to scrap many unnecessary limitations of MR for parallel use that are holding up performance on many algorithms (esp. lin alg algorithms).
        Hide
        Dmitriy Lyubimov added a comment -

        Does 0.21 bring back map-side joins and multiple outputs? We don't use 0.21 in production at Twitter, and I know tons of other places who haven't migrated up yet either.

        Yes i beleive it does but not in a quite compatible way with old spec and they are not in CDH.

        They also dropped some support for good i think (such as MultipleOutputFormat) and incorporated these capabilities into MultipleOutputs which makes some code upgrades a little bit more intensive than simple class name change.

        Show
        Dmitriy Lyubimov added a comment - Does 0.21 bring back map-side joins and multiple outputs? We don't use 0.21 in production at Twitter, and I know tons of other places who haven't migrated up yet either. Yes i beleive it does but not in a quite compatible way with old spec and they are not in CDH. They also dropped some support for good i think (such as MultipleOutputFormat) and incorporated these capabilities into MultipleOutputs which makes some code upgrades a little bit more intensive than simple class name change.
        Hide
        Sean Owen added a comment -

        The people have spoken! Forget 0.21. At best I think we will wait and see on 0.22.

        I think the much larger concern in my mind is being as consistent as possible across the project in how implementations are approached. Within 0.20.x we can stand to be more consistent – ideally, not using the deprecated APIs, but using them if there's a very good reason.

        The clustering code is still needlessly different for example, and is going to be deprecated as a result.

        So I think the outcome from this issue is... try to make it as like to the other M/R jobs in the project? Most everything tries to use the imperfect AbstractJob thing, which is a good rallying point. I am not sure how realistic it is here but it would be great to standardize more.

        Show
        Sean Owen added a comment - The people have spoken! Forget 0.21. At best I think we will wait and see on 0.22. I think the much larger concern in my mind is being as consistent as possible across the project in how implementations are approached. Within 0.20.x we can stand to be more consistent – ideally, not using the deprecated APIs, but using them if there's a very good reason. The clustering code is still needlessly different for example, and is going to be deprecated as a result. So I think the outcome from this issue is... try to make it as like to the other M/R jobs in the project? Most everything tries to use the imperfect AbstractJob thing, which is a good rallying point. I am not sure how realistic it is here but it would be great to standardize more.
        Hide
        Shannon Quinn added a comment - - edited

        Ok, this is absolutely a total hack job, but I wanted to see if it would work: taking the 0.21 mapreduce.lib.join* package, tweaking it slightly to make it 0.20-compatible, and installing it directly in Mahout to make DistributedRowMatrix 0.20-compliant.

        It and the associated tests compile, but I've run into a problem of failing tests, the cause of which seems to be that it won't write files to DistributedCache, HDFS, etc. I tried writing to DistributedCache and immediately reading it back, which worked fine, but that didn't exactly inform me as to why it can't be read within the Mapper. So otherwise I'm stuck and could use some help.

        If this isn't an avenue worth pursuing, that's also fine. I had the idea and wanted to give it a shot before throwing in the towel and waiting for 0.22.

        Show
        Shannon Quinn added a comment - - edited Ok, this is absolutely a total hack job, but I wanted to see if it would work: taking the 0.21 mapreduce.lib.join* package, tweaking it slightly to make it 0.20-compatible, and installing it directly in Mahout to make DistributedRowMatrix 0.20-compliant. It and the associated tests compile, but I've run into a problem of failing tests, the cause of which seems to be that it won't write files to DistributedCache, HDFS, etc. I tried writing to DistributedCache and immediately reading it back, which worked fine, but that didn't exactly inform me as to why it can't be read within the Mapper. So otherwise I'm stuck and could use some help. If this isn't an avenue worth pursuing, that's also fine. I had the idea and wanted to give it a shot before throwing in the towel and waiting for 0.22.
        Hide
        Sean Owen added a comment -

        I think it's a great effort. Looks like you had to copy in a number of Hadoop classes and still are facing some problems. It may be a hard road to go down.

        We're "officially" on 0.20.203.0 at the moment, and in my mind, the essence of this issue is seeing if there's any way to use the .mapreduce. rather than deprecated .mapred. APIs in 0.20.x at least, or, reuse more of AbstractJob for consistency (improving it as needed).

        Do you see any scope for those types of changes? If the informed opinion is just that this isn't going to be meaningfully possible, I say close this issue.

        Show
        Sean Owen added a comment - I think it's a great effort. Looks like you had to copy in a number of Hadoop classes and still are facing some problems. It may be a hard road to go down. We're "officially" on 0.20.203.0 at the moment, and in my mind, the essence of this issue is seeing if there's any way to use the .mapreduce. rather than deprecated .mapred. APIs in 0.20.x at least, or, reuse more of AbstractJob for consistency (improving it as needed). Do you see any scope for those types of changes? If the informed opinion is just that this isn't going to be meaningfully possible, I say close this issue.
        Hide
        Shannon Quinn added a comment -

        I thought it would be simpler. Granted I know very little of how HDFS works, so I'm not sure what's causing the problems or how to fix it. The fact that nothing written can be read back later (tests come back with 0 values or empty lists, or files simply don't exist where the Configuration says they should) seems like it should be an easy fix, but I don't know where to start.

        The next best thing to this approach was to more or less mimic the bare necessities of these dependencies in custom implementations, something I don't have the expertise for just yet. I was hoping this would serve only as a holdover until 0.22+ when the dependencies are officially re-included, but in the meantime would enable us to move entirely off 0.18.

        Again, it was just a wild idea and I wanted to see if it would work. Still want to see if it will work, in fact.

        Show
        Shannon Quinn added a comment - I thought it would be simpler. Granted I know very little of how HDFS works, so I'm not sure what's causing the problems or how to fix it. The fact that nothing written can be read back later (tests come back with 0 values or empty lists, or files simply don't exist where the Configuration says they should) seems like it should be an easy fix, but I don't know where to start. The next best thing to this approach was to more or less mimic the bare necessities of these dependencies in custom implementations, something I don't have the expertise for just yet. I was hoping this would serve only as a holdover until 0.22+ when the dependencies are officially re-included, but in the meantime would enable us to move entirely off 0.18. Again, it was just a wild idea and I wanted to see if it would work. Still want to see if it will work, in fact.
        Hide
        Sean Owen added a comment -

        Likewise I think time to give up on this one. It seems not too feasible to change this code.

        Show
        Sean Owen added a comment - Likewise I think time to give up on this one. It seems not too feasible to change this code.

          People

          • Assignee:
            Shannon Quinn
            Reporter:
            Shannon Quinn
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development