Details

      Description

      I suggest we should move to Hadoop 0.20.203.0 for the next release. (Not 0.21 or further.) It is a much more recent branch of 0.20.x and is compile-time compatible with 0.20.2 in our code already.

      However I know already that switching to it causes some failures, in the Lanczos jobs for instances. Looks like something's expecting a file somewhere that isn't where it used to be. I bet it's an easy fix, but don't know what it is yet.

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #901 (See https://builds.apache.org/job/Mahout-Quality/901/)
        MAHOUT-708 update to Hadoop 0.20.203.0, which just entailed better logic to ignore new _SUCCESS files. The result still works in 0.20.2

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1139072
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirIterable.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/ga/watchmaker/OutputUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
        • /mahout/trunk/pom.xml
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirIterator.java
        • /mahout/trunk/core/pom.xml
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirValueIterable.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/eval/ParallelFactorizationEvaluator.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirValueIterator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileIterable.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DocumentProcessorTest.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileValueIterable.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #901 (See https://builds.apache.org/job/Mahout-Quality/901/ ) MAHOUT-708 update to Hadoop 0.20.203.0, which just entailed better logic to ignore new _SUCCESS files. The result still works in 0.20.2 srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1139072 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirIterable.java /mahout/trunk/core/src/main/java/org/apache/mahout/ga/watchmaker/OutputUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java /mahout/trunk/pom.xml /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirIterator.java /mahout/trunk/core/pom.xml /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirValueIterable.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/eval/ParallelFactorizationEvaluator.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileDirValueIterator.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileIterable.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DocumentProcessorTest.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/sequencefile/SequenceFileValueIterable.java
        Hide
        Sean Owen added a comment -

        So, updating to 0.20.203.0 was almost painless. There were two problems.

        Hadoop 0.20.203.0 depends on Jackson from Codehaus, but doesn't declare it in the POM. So we had to add that dependency manually.

        And, it also writes _SUCCESS files in output dirs. Many bits of code and tests still didn't correctly filter these out. this was easy to fix. Incidentally this ought to fix some problems people see on CDH, which has the same behavior.

        I should stress that the resulting code is still entirely compatible with Hadoop 0.20.2. We haven't really upped requirements.

        Show
        Sean Owen added a comment - So, updating to 0.20.203.0 was almost painless. There were two problems. Hadoop 0.20.203.0 depends on Jackson from Codehaus, but doesn't declare it in the POM. So we had to add that dependency manually. And, it also writes _SUCCESS files in output dirs. Many bits of code and tests still didn't correctly filter these out. this was easy to fix. Incidentally this ought to fix some problems people see on CDH, which has the same behavior. I should stress that the resulting code is still entirely compatible with Hadoop 0.20.2. We haven't really upped requirements.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #848 (See https://builds.apache.org/hudson/job/Mahout-Quality/848/)

        Show
        Hudson added a comment - Integrated in Mahout-Quality #848 (See https://builds.apache.org/hudson/job/Mahout-Quality/848/ )
        Hide
        Dmitriy Lyubimov added a comment -

        -1 in general.

        Most folks are either on 0.20.2 (EMR) or CDH3 (baremetal). I know of no one using 0.21. I am not sure that using 0.21 new api will be 100% compatible with CDH3, there are still some missing pieces there. So if you move, you may have me locked in 0.5 since i am a CDH3 user. (and EMR for bigger trains).

        What i think might be reasonable is to create a branch with cdh3 dependencies and make sure all tests are passing (i saw 2 or 3 not passing), albeit generally everything compiles with cdh3. Then we would cover all major camps out there with practically same codebase.

        Yes i am also waiting for new hadoop architecture to come out (i think they were saying mid summer), a fundamental rewrite where task resource is separated from a concept of application (i.e. map reduce) and that would really be great. That would be a worthy update.

        -d

        Show
        Dmitriy Lyubimov added a comment - -1 in general. Most folks are either on 0.20.2 (EMR) or CDH3 (baremetal). I know of no one using 0.21. I am not sure that using 0.21 new api will be 100% compatible with CDH3, there are still some missing pieces there. So if you move, you may have me locked in 0.5 since i am a CDH3 user. (and EMR for bigger trains). What i think might be reasonable is to create a branch with cdh3 dependencies and make sure all tests are passing (i saw 2 or 3 not passing), albeit generally everything compiles with cdh3. Then we would cover all major camps out there with practically same codebase. Yes i am also waiting for new hadoop architecture to come out (i think they were saying mid summer), a fundamental rewrite where task resource is separated from a concept of application (i.e. map reduce) and that would really be great. That would be a worthy update. -d
        Hide
        Ted Dunning added a comment -

        Frankly, I think that the next version of Hadoop that provides any compelling features for large enterprises is likely to be 0.23 where MR nextgen comes into play. Ironically, the major reason that version will be exciting is that it allows for compatibility with old API's. The ability to have old and new side-by-side is a pre-requisite for any large cluster upgrade.

        Show
        Ted Dunning added a comment - Frankly, I think that the next version of Hadoop that provides any compelling features for large enterprises is likely to be 0.23 where MR nextgen comes into play. Ironically, the major reason that version will be exciting is that it allows for compatibility with old API's. The ability to have old and new side-by-side is a pre-requisite for any large cluster upgrade.
        Hide
        Sean Owen added a comment -

        Let's sit on this a while longer then. We should at least get onto 0.20.203.
        Yes, it brings back joins and multiple outputs, which is 80% of the reason we'd want it.
        I'm on 0.22 and it makes it possible to build a recommender pipeline half as complex and about 5x as fast. It's big for some machine learning apps.

        Show
        Sean Owen added a comment - Let's sit on this a while longer then. We should at least get onto 0.20.203. Yes, it brings back joins and multiple outputs, which is 80% of the reason we'd want it. I'm on 0.22 and it makes it possible to build a recommender pipeline half as complex and about 5x as fast. It's big for some machine learning apps.
        Hide
        Jake Mannix added a comment -

        I'm open to being convinced, but as stated on the list, I'm starting from a position of -1 on this.

        For the same reason that Lucene didn't move to Java 5 until last year, we may be stuck with mixed 0.18/0.20+ apis for quite some time.

        Anyone running their own cluster can certainly upgrade to 0.22, but production systems, by and large, are far more conservative, and have also often been bitten by hadoop upgrades (0.19, anyone?) in the past, and so are really loathe to just jump onto the newest "blessed" release.

        If someone can convince me that everyone has moved on to 0.21+, then sure. But so far, that is not my experience, at all.

        Show
        Jake Mannix added a comment - I'm open to being convinced, but as stated on the list, I'm starting from a position of -1 on this. For the same reason that Lucene didn't move to Java 5 until last year , we may be stuck with mixed 0.18/0.20+ apis for quite some time. Anyone running their own cluster can certainly upgrade to 0.22, but production systems, by and large, are far more conservative, and have also often been bitten by hadoop upgrades (0.19, anyone?) in the past, and so are really loathe to just jump onto the newest "blessed" release. If someone can convince me that everyone has moved on to 0.21+, then sure. But so far, that is not my experience, at all.
        Hide
        Shannon Quinn added a comment -

        Just like this! (re: my last email)

        Show
        Shannon Quinn added a comment - Just like this! (re: my last email)

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sean Owen
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development