Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: Classification
    • Labels:

      Description

      Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code.

      1. MAHOUT-1493.patch
        5 kB
        Christoph Viebig
      2. MAHOUT-1493.patch
        1 kB
        Sebastian Schelter
      3. MAHOUT-1493.patch
        5 kB
        Sebastian Schelter
      4. MAHOUT-1493.patch
        3 kB
        Sebastian Schelter
      5. MAHOUT-1493a.patch
        4 kB
        Andrew Palumbo

        Issue Links

          Activity

          Hide
          ssc Sebastian Schelter added a comment -

          preliminary patch. lacks preparation code and testing.

          Show
          ssc Sebastian Schelter added a comment - preliminary patch. lacks preparation code and testing.
          Hide
          dlyubimov Dmitriy Lyubimov added a comment - - edited

          I don't think you meant run() to return Unit.

          Also I am not sure using a class is justified.

          In most cases, i would favor dropping classes in favor of functions, albeit with fairly long parameter list but populaed with default values.

          The pattern i am following is to create a pithy and expressive name (such as ssvd()) for a function (in this case could be something like "trainNB") inside a scala object (singleton) and then re-translate that as top-level package function so one can say something like

          import decompositions._
          val nbmodel = trainNB(...)
          ...
          
          Show
          dlyubimov Dmitriy Lyubimov added a comment - - edited I don't think you meant run() to return Unit. Also I am not sure using a class is justified. In most cases, i would favor dropping classes in favor of functions, albeit with fairly long parameter list but populaed with default values. The pattern i am following is to create a pithy and expressive name (such as ssvd()) for a function (in this case could be something like "trainNB") inside a scala object (singleton) and then re-translate that as top-level package function so one can say something like import decompositions._ val nbmodel = trainNB(...) ...
          Hide
          dlyubimov Dmitriy Lyubimov added a comment -

          PS. it's an R naming style. R almost never exposes api as classes (and, frankly, R classes – even the latest generation – is embarassment compared to everything else in existence).

          Classes are usually needed if there's a state, and we already have that state as the bayes model object, don't we?

          Show
          dlyubimov Dmitriy Lyubimov added a comment - PS. it's an R naming style. R almost never exposes api as classes (and, frankly, R classes – even the latest generation – is embarassment compared to everything else in existence). Classes are usually needed if there's a state, and we already have that state as the bayes model object, don't we?
          Hide
          ssc Sebastian Schelter added a comment -

          Updated the patch according to Dmitriy's style suggestions.

          The only thing that is executed in a distributed fashion is the summation of the observations for each label. The rest of the training is done in memory on the driver. This should not be a problem as the data used by the local training has to fit in memory for the existing NaiveBayesModel anyway.

          The patch needs to be verified on a real example yet. I will do this in May probably. If someone wants to test it before that time, I'd be happy to help.

          Show
          ssc Sebastian Schelter added a comment - Updated the patch according to Dmitriy's style suggestions. The only thing that is executed in a distributed fashion is the summation of the observations for each label. The rest of the training is done in memory on the driver. This should not be a problem as the data used by the local training has to fit in memory for the existing NaiveBayesModel anyway. The patch needs to be verified on a real example yet. I will do this in May probably. If someone wants to test it before that time, I'd be happy to help.
          Hide
          dlyubimov Dmitriy Lyubimov added a comment -

          I d say we really need that shell finished. That would be a killer way to write both distributed tests and tutorial at the same time

          Show
          dlyubimov Dmitriy Lyubimov added a comment - I d say we really need that shell finished. That would be a killer way to write both distributed tests and tutorial at the same time
          Hide
          dlyubimov Dmitriy Lyubimov added a comment -

          Also – to finalize this – I think we need start doing blackbox packages outside the drm package. Maybe even outside sparkbindings package. At least idea was that drm package holds base math functions only – such as qr and svd. Well since PCA is very math-y , so PCA as well .

          NB is a classification method so we proabably need to start classification package outside drm. Ditto for other ML stuff.

          Show
          dlyubimov Dmitriy Lyubimov added a comment - Also – to finalize this – I think we need start doing blackbox packages outside the drm package. Maybe even outside sparkbindings package. At least idea was that drm package holds base math functions only – such as qr and svd. Well since PCA is very math-y , so PCA as well . NB is a classification method so we proabably need to start classification package outside drm. Ditto for other ML stuff.
          Hide
          ssc Sebastian Schelter added a comment -

          ok, where should the blackbox algorithms go then?

          Show
          ssc Sebastian Schelter added a comment - ok, where should the blackbox algorithms go then?
          Hide
          dlyubimov Dmitriy Lyubimov added a comment -

          "o.a.m.classification", "o.a.m.clustering" ... ? or that intersects with java code inconveniently?

          Show
          dlyubimov Dmitriy Lyubimov added a comment - "o.a.m.classification", "o.a.m.clustering" ... ? or that intersects with java code inconveniently?
          Hide
          ssc Sebastian Schelter added a comment -

          Updated patch to reflect the work from MAHOUT-1504

          Show
          ssc Sebastian Schelter added a comment - Updated patch to reflect the work from MAHOUT-1504
          Hide
          dlyubimov Dmitriy Lyubimov added a comment -

          convert to a PR?

          Show
          dlyubimov Dmitriy Lyubimov added a comment - convert to a PR?
          Hide
          cviebig Christoph Viebig added a comment - - edited

          As part of an university course homework at TU Berlin, Germany, we (a group of students) got the task to continue the work done on this issue. To achieve that we ported Sebastian's code to a recent mahout master revision to make it work again.
          Following changes were applied to Sebastian's patch right now:

          • StandardThetaTrainer removed
          • trainNB method parameter trainComplementary removed
          • method call dense was changed to org.apache.mahout.math.scalabindings method dense
          • method call colSums was changed to new MatrixOps(_).colSums
          • method call rowSums was changed to new MatrixOps(_).rowSums
          • weightsPerLabelAndFeature(labelIndex, :: ) }} was changed to {{weightsPerLabelAndFeature.viewRow(labelIndex) as it seems to be not supported by the java class

          We noticed that you refactored the code a lot in the mean time. Shall we move the patch code and the one of the used classes from mrlegacy to another location?

          Show
          cviebig Christoph Viebig added a comment - - edited As part of an university course homework at TU Berlin, Germany, we (a group of students) got the task to continue the work done on this issue. To achieve that we ported Sebastian's code to a recent mahout master revision to make it work again. Following changes were applied to Sebastian's patch right now: StandardThetaTrainer removed trainNB method parameter trainComplementary removed method call dense was changed to org.apache.mahout.math.scalabindings method dense method call colSums was changed to new MatrixOps(_).colSums method call rowSums was changed to new MatrixOps(_).rowSums weightsPerLabelAndFeature(labelIndex, :: ) }} was changed to {{weightsPerLabelAndFeature.viewRow(labelIndex) as it seems to be not supported by the java class We noticed that you refactored the code a lot in the mean time. Shall we move the patch code and the one of the used classes from mrlegacy to another location?
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment - - edited

          Christoph Viebig, The patch looks good. I've made some edits (against your develop/issue/1493/4 branch) and will attach a M-1493a.patch shortly. I put the trainComplementary parameter back in as this is needed to make the distinction between Standard and Complementary Models in the NaiveBayesModel constructor.

          As well, I've added a thetaNormalizer var which can remain null when passed to the NaiveBayesModel constructor unless training a Complementary NB model. see:

          https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java

          I'm not sure if creating a null var as I've done here is best practice in scala, but i wanted to give you an idea of the NaiveBayesModel design.

          As you've noted, there has been a lot of refactoring going on. As far as moving the code, I think that for now it might be a good idea to keep this in `spark` module, and move the `org.apache.mahout.sparkbindings.drm.classification` package out of `org.apache.mahout.sparkbindings.drm` and into a new `org.apache.mahout.classification` package. I believe that for now this would be a good place for it. There shouldn't be any need to move any of the java code from mrlegacy.

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - - edited Christoph Viebig , The patch looks good. I've made some edits (against your develop/issue/1493/4 branch) and will attach a M-1493a.patch shortly. I put the trainComplementary parameter back in as this is needed to make the distinction between Standard and Complementary Models in the NaiveBayesModel constructor. As well, I've added a thetaNormalizer var which can remain null when passed to the NaiveBayesModel constructor unless training a Complementary NB model. see: https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java I'm not sure if creating a null var as I've done here is best practice in scala, but i wanted to give you an idea of the NaiveBayesModel design. As you've noted, there has been a lot of refactoring going on. As far as moving the code, I think that for now it might be a good idea to keep this in `spark` module, and move the `org.apache.mahout.sparkbindings.drm.classification` package out of `org.apache.mahout.sparkbindings.drm` and into a new `org.apache.mahout.classification` package. I believe that for now this would be a good place for it. There shouldn't be any need to move any of the java code from mrlegacy.
          Hide
          pferrel Pat Ferrel added a comment -

          Andrew Palumbo can you make this a pull request? Put it in a branch on your github and hit the PR button so we can see the diffs and comment inline on github.

          Show
          pferrel Pat Ferrel added a comment - Andrew Palumbo can you make this a pull request? Put it in a branch on your github and hit the PR button so we can see the diffs and comment inline on github.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrewpalumbo opened a pull request:

          https://github.com/apache/mahout/pull/32

          MAHOUT-1493 Port Naive Bayes to Spark DSL (for review)

          Created this PR to review the patch provided by Christoph Viebig and team, Students of TU, Berlin. It is based on Sebastian's code from before the DSL was abstracted away from Spark.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/32.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #32


          commit eb0b76e96e2003e0061e30d6e1fbfb4c677dc836
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-07-18T04:55:22Z

          applied MAHOUT-1493.patch by cviebig and team from TU Berlin

          commit a99909db863d486115893a79f5f57848edf16b93
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-07-18T04:59:24Z

          applied MAHOUT-1493a patch: edits to NaiveBayesModel call


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrewpalumbo opened a pull request: https://github.com/apache/mahout/pull/32 MAHOUT-1493 Port Naive Bayes to Spark DSL (for review) Created this PR to review the patch provided by Christoph Viebig and team, Students of TU, Berlin. It is based on Sebastian's code from before the DSL was abstracted away from Spark. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/32.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #32 commit eb0b76e96e2003e0061e30d6e1fbfb4c677dc836 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-07-18T04:55:22Z applied MAHOUT-1493 .patch by cviebig and team from TU Berlin commit a99909db863d486115893a79f5f57848edf16b93 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-07-18T04:59:24Z applied MAHOUT-1493 a patch: edits to NaiveBayesModel call
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-49396941

          One of their first questions was where to move it (out of the o.a.m.sparkbindings.drm package). My suggestion, for now, was a new package in the spark module: o.a.m.classification.(naivebayes). mrlegacy keeps classification algo packages in o.a.m.classifier. If we wanted to keep consistent with naming (stick with classifiers v. classification) would this cause conflicts?

          The next step would be to create some scalatests for this. This should be relatively easy since most of the rest of mrlegacy NB has is either MR independent or has sequential options. So it should be somewhat straightforward to reproduce (some of) the mrlegacy tests.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-49396941 One of their first questions was where to move it (out of the o.a.m.sparkbindings.drm package). My suggestion, for now, was a new package in the spark module: o.a.m.classification.(naivebayes). mrlegacy keeps classification algo packages in o.a.m.classifier. If we wanted to keep consistent with naming (stick with classifiers v. classification) would this cause conflicts? The next step would be to create some scalatests for this. This should be relatively easy since most of the rest of mrlegacy NB has is either MR independent or has sequential options. So it should be somewhat straightforward to reproduce (some of) the mrlegacy tests.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-49397174

          Also, the MAHOUT-1493a patch was written by me to illustrate how the NaiveBayesModel constructor needed to be called- so don't hold any style points against them on this .

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-49397174 Also, the MAHOUT-1493 a patch was written by me to illustrate how the NaiveBayesModel constructor needed to be called- so don't hold any style points against them on this .
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-49400739

          @andrewpalumbo , this code very much feels like it should be in math-scala/src/main/scala/org/apache/mahout/math/classification/*. There is nothing (yet) specific to DRM in this code, so we can keep it out of sparkbindings, at least this part of the algorithm.

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-49400739 @andrewpalumbo , this code very much feels like it should be in math-scala/src/main/scala/org/apache/mahout/math/classification/*. There is nothing (yet) specific to DRM in this code, so we can keep it out of sparkbindings, at least this part of the algorithm.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-49670252

          Agree with @avati, without looking at much else, NB definitely doesn't belong in `package org.apache.mahout.sparkbindings.drm.classification`. A better root perhaps is simply `o.a.m.classification`, even if it clashes with some legacy (java) packages ?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-49670252 Agree with @avati, without looking at much else, NB definitely doesn't belong in `package org.apache.mahout.sparkbindings.drm.classification`. A better root perhaps is simply `o.a.m.classification`, even if it clashes with some legacy (java) packages ?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51404288

          So, can we move the package please?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51404288 So, can we move the package please?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51405333

          Absolutely! I was just getting ready to do this and write some tests.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51405333 Absolutely! I was just getting ready to do this and write some tests.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51405454

          Am hoping to see how (short) the Spark and H2O tests compare

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51405454 Am hoping to see how (short) the Spark and H2O tests compare
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51406299

          Actually I'm not sure if this would work against H2O, as the code is doing

          observationsPerLabel.map(new MatrixOps(_).colSums)

          which happens on RDD (and not on DRM, because of the implicit conversion). We would need to generic'ize that somehow.

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51406299 Actually I'm not sure if this would work against H2O, as the code is doing observationsPerLabel.map(new MatrixOps(_).colSums) which happens on RDD (and not on DRM, because of the implicit conversion). We would need to generic'ize that somehow.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51406631

          @andrewpalumbo can you perhaps comment on the code itself so we see what you are talking about?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51406631 @andrewpalumbo can you perhaps comment on the code itself so we see what you are talking about?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51406691

          Oops, I misread.. the map() happens on Array, my bad!

          I must admit I do not (yet) know how this code is working on DRM in a distributed way (i.e, to compare two backends)

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51406691 Oops, I misread.. the map() happens on Array, my bad! I must admit I do not (yet) know how this code is working on DRM in a distributed way (i.e, to compare two backends)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r15908429

          — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala —
          @@ -0,0 +1,74 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.mahout.sparkbindings.drm.classification
          +
          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.scalabindings
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
          +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
          +
          — End diff –

          Imports look engine-independent. Should not be Spark-coupled.
          Imports probably should include implicit operations per document (which is why the code does weird stuff like `new MatrixOps(m)`).

          The "standard" recommended way to do imports for engine -independent code

          import org.apache.mahout.math._
          import scalabindings._
          import RLikeOps._
          import drm._
          import RLikeDrmOps._

          if java collections are used (e.g. something like `for (row <- matrix)

          { ... }

          `) then it also would need

          import collection._
          import JavaConversions._

          to enable all implicits.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908429 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala — @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + — End diff – Imports look engine-independent. Should not be Spark-coupled. Imports probably should include implicit operations per document (which is why the code does weird stuff like `new MatrixOps(m)`). The "standard" recommended way to do imports for engine -independent code import org.apache.mahout.math._ import scalabindings._ import RLikeOps._ import drm._ import RLikeDrmOps._ if java collections are used (e.g. something like `for (row <- matrix) { ... } `) then it also would need import collection._ import JavaConversions._ to enable all implicits.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r15908475

          — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala —
          @@ -0,0 +1,74 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.mahout.sparkbindings.drm.classification
          +
          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.scalabindings
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
          +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
          +
          +import scala.reflect.ClassTag
          +
          +/**
          + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor
          + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
          + */
          +object NaiveBayes {
          +
          + /** default value for the smoothing parameter */
          + def defaultAlphaI = 1f
          +
          + /**
          + * Distributed training of a Naive Bayes model.
          + *
          + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label.
          + * @param trainComplementary whether to train a complementary Naive Bayes model
          + * @param alphaI smoothing parameter
          + * @return trained naive bayes model
          + */
          + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true,
          + alphaI: Float = defaultAlphaI): NaiveBayesModel = {
          +
          + // distributed summation of all observations per label
          + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
          — End diff –

          if imports are properly done, this should just be

          val weightsPerLabelAndFeature = dense(observationsPerLabel.map(_.colSums))

          Note that this is not Spark-dependent code. the `map` here is Scala collection map.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908475 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala — @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB [K: ClassTag] (observationsPerLabel: Array[DrmLike [K] ], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + + // distributed summation of all observations per label + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) — End diff – if imports are properly done, this should just be val weightsPerLabelAndFeature = dense(observationsPerLabel.map(_.colSums)) Note that this is not Spark-dependent code. the `map` here is Scala collection map.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r15908489

          — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala —
          @@ -0,0 +1,74 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.mahout.sparkbindings.drm.classification
          +
          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.scalabindings
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
          +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
          +
          +import scala.reflect.ClassTag
          +
          +/**
          + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor
          + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
          + */
          +object NaiveBayes {
          +
          + /** default value for the smoothing parameter */
          + def defaultAlphaI = 1f
          +
          + /**
          + * Distributed training of a Naive Bayes model.
          + *
          + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label.
          + * @param trainComplementary whether to train a complementary Naive Bayes model
          + * @param alphaI smoothing parameter
          + * @return trained naive bayes model
          + */
          + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true,
          + alphaI: Float = defaultAlphaI): NaiveBayesModel = {
          +
          + // distributed summation of all observations per label
          + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
          + // local summation of all weights per feature
          + val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums
          — End diff –

          Similarly this should be just

          val weightsPerFeature = weightsPerLabelAndFeature.colSums

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908489 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala — @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB [K: ClassTag] (observationsPerLabel: Array[DrmLike [K] ], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + + // distributed summation of all observations per label + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) + // local summation of all weights per feature + val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums — End diff – Similarly this should be just val weightsPerFeature = weightsPerLabelAndFeature.colSums
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r15908508

          — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala —
          @@ -0,0 +1,74 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.mahout.sparkbindings.drm.classification
          +
          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.scalabindings
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
          +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
          +
          +import scala.reflect.ClassTag
          +
          +/**
          + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor
          + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
          + */
          +object NaiveBayes {
          +
          + /** default value for the smoothing parameter */
          + def defaultAlphaI = 1f
          +
          + /**
          + * Distributed training of a Naive Bayes model.
          + *
          + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label.
          + * @param trainComplementary whether to train a complementary Naive Bayes model
          + * @param alphaI smoothing parameter
          + * @return trained naive bayes model
          + */
          + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true,
          + alphaI: Float = defaultAlphaI): NaiveBayesModel = {
          +
          + // distributed summation of all observations per label
          + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
          + // local summation of all weights per feature
          + val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums
          + // local summation of all weights per label
          + val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums
          — End diff –

          Same thing here. now need for `new MatrixOps`... here and elsewhere

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908508 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala — @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB [K: ClassTag] (observationsPerLabel: Array[DrmLike [K] ], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + + // distributed summation of all observations per label + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) + // local summation of all weights per feature + val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums + // local summation of all weights per label + val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums — End diff – Same thing here. now need for `new MatrixOps`... here and elsewhere
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51407423

          Yes-- and as a reminder this code is a compilation of patches that were writen before the abastraction away from spark (not by myself). I've not looked at it too closely, and just put it up for comment and feedback for the Berlin TU students. I've been away almost the entire summer and had planned on doing some work on this to get back up to speed and to try out the H2o bindings.

          So please- yes any comments are welcome.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51407423 Yes-- and as a reminder this code is a compilation of patches that were writen before the abastraction away from spark (not by myself). I've not looked at it too closely, and just put it up for comment and feedback for the Berlin TU students. I've been away almost the entire summer and had planned on doing some work on this to get back up to speed and to try out the H2o bindings. So please- yes any comments are welcome.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r15908555

          — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala —
          @@ -0,0 +1,74 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.mahout.sparkbindings.drm.classification
          +
          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.scalabindings
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
          +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
          +
          +import scala.reflect.ClassTag
          +
          +/**
          + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor
          + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
          + */
          +object NaiveBayes {
          +
          + /** default value for the smoothing parameter */
          + def defaultAlphaI = 1f
          +
          + /**
          + * Distributed training of a Naive Bayes model.
          + *
          + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label.
          + * @param trainComplementary whether to train a complementary Naive Bayes model
          + * @param alphaI smoothing parameter
          + * @return trained naive bayes model
          + */
          + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true,
          + alphaI: Float = defaultAlphaI): NaiveBayesModel = {
          +
          + // distributed summation of all observations per label
          + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
          + // local summation of all weights per feature
          + val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums
          + // local summation of all weights per label
          + val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums
          +
          + // perLabelThetaNormalizer Vector is expected by NaiveBayesModel. We can pass a null value
          + // in the case of a standard NB model
          + var thetaNormalizer: org.apache.mahout.math.Vector= null
          +
          + // instantiate a trainer and retrieve the perLabelThetaNormalizer Vector from it in the case of
          + // a complementary NB model
          + if( trainComplementary ){
          + val thetaTrainer = new ComplementaryThetaTrainer(weightsPerFeature, weightsPerLabel, alphaI)
          + // local training of the theta normalization
          + for (labelIndex <- 0 until new MatrixOps(weightsPerLabelAndFeature).nrow) {
          + thetaTrainer.train(labelIndex, weightsPerLabelAndFeature.viewRow(labelIndex))
          — End diff –

          in Mahout Scala, this slicing should look

          ... weightsPerLabelAndFeature(labelIndex, :

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908555 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala — @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB [K: ClassTag] (observationsPerLabel: Array[DrmLike [K] ], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + + // distributed summation of all observations per label + val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) + // local summation of all weights per feature + val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums + // local summation of all weights per label + val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums + + // perLabelThetaNormalizer Vector is expected by NaiveBayesModel. We can pass a null value + // in the case of a standard NB model + var thetaNormalizer: org.apache.mahout.math.Vector= null + + // instantiate a trainer and retrieve the perLabelThetaNormalizer Vector from it in the case of + // a complementary NB model + if( trainComplementary ){ + val thetaTrainer = new ComplementaryThetaTrainer(weightsPerFeature, weightsPerLabel, alphaI) + // local training of the theta normalization + for (labelIndex <- 0 until new MatrixOps(weightsPerLabelAndFeature).nrow) { + thetaTrainer.train(labelIndex, weightsPerLabelAndFeature.viewRow(labelIndex)) — End diff – in Mahout Scala, this slicing should look ... weightsPerLabelAndFeature(labelIndex, :
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r15908947

          — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala —
          @@ -0,0 +1,74 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.mahout.sparkbindings.drm.classification
          +
          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.scalabindings
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
          +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
          +
          +import scala.reflect.ClassTag
          +
          +/**
          + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor
          + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
          + */
          +object NaiveBayes {
          +
          + /** default value for the smoothing parameter */
          + def defaultAlphaI = 1f
          — End diff –

          Mahout convention is to write these as `1.0` rather than a float.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908947 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala — @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f — End diff – Mahout convention is to write these as `1.0` rather than a float.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51414355

          @dlyubimov thanks for all the comments I'm going to try to get the changes in shortly.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51414355 @dlyubimov thanks for all the comments I'm going to try to get the changes in shortly.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51662654

          I made most of the changes from Dmitriy's comments. I've done some (hackish) work here just to get this in the right package, compiling and and testing. Changed the Array[DrmLike] to DrmLike for the sparse feature input (for now). This is very basic and assumes that each row correspnds to a unique label. The only real engine specific DRM work right now is done in:

          val weightsPerFeature = observationsPerLabel.colSums

          I've added a Spark test suite with a test for a skeleton NB model. Tests pass here on Spark. I've also added an h2o test suite on my MAHOUT-1493-1500 branch with relativly minimal effort (had to make a few dependency changes to the h20/pom.xml). h2o tests pass.

          Obviously there is still a lot of work to do here and it won't be ready to merge anytime soon, so I'll leave this PR open for a little while in case anybody's interested and then close it until i have some more work done on it as to not clog up the PR page.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51662654 I made most of the changes from Dmitriy's comments. I've done some (hackish) work here just to get this in the right package, compiling and and testing. Changed the Array [DrmLike] to DrmLike for the sparse feature input (for now). This is very basic and assumes that each row correspnds to a unique label. The only real engine specific DRM work right now is done in: val weightsPerFeature = observationsPerLabel.colSums I've added a Spark test suite with a test for a skeleton NB model. Tests pass here on Spark. I've also added an h2o test suite on my MAHOUT-1493 -1500 branch with relativly minimal effort (had to make a few dependency changes to the h20/pom.xml). h2o tests pass. Obviously there is still a lot of work to do here and it won't be ready to merge anytime soon, so I'll leave this PR open for a little while in case anybody's interested and then close it until i have some more work done on it as to not clog up the PR page.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16019061

          — Diff: CHANGELOG —
          @@ -2,7 +2,15 @@ Mahout Change Log

          Release 1.0 - unreleased

          • MAHOUT-1583: cbind() operator for Scala DRMs
            + MAHOUT-1596: implement rbind() operator (Anand Avati and dlyubimov)
            +
            + MAHOUT-1597: A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing rows, Spark side (dlyubimov)
            +
            + MAHOUT-1595: MatrixVectorView - implement a proper iterateNonZero() (Anand Avati via dlyubimov)
            +
            + MAHOUT-1529(e): Move dense/sparse matrix test in mapBlock into spark (Anand Avati via dlyubimov)
            +
            + MAHOUT-1583: cbind() operator for Scala DRMs (dlyubimov)

          — End diff –

          er...not sure why changelog shows these changes, please sync to latest master so no changes until squash-commit please.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16019061 — Diff: CHANGELOG — @@ -2,7 +2,15 @@ Mahout Change Log Release 1.0 - unreleased MAHOUT-1583 : cbind() operator for Scala DRMs + MAHOUT-1596 : implement rbind() operator (Anand Avati and dlyubimov) + + MAHOUT-1597 : A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing rows, Spark side (dlyubimov) + + MAHOUT-1595 : MatrixVectorView - implement a proper iterateNonZero() (Anand Avati via dlyubimov) + + MAHOUT-1529 (e): Move dense/sparse matrix test in mapBlock into spark (Anand Avati via dlyubimov) + + MAHOUT-1583 : cbind() operator for Scala DRMs (dlyubimov) — End diff – er...not sure why changelog shows these changes, please sync to latest master so no changes until squash-commit please.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16019136

          — Diff: math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala —
          @@ -0,0 +1,72 @@
          +package org.apache.mahout.classifier.naivebayes
          +
          +import org.apache.mahout.math._
          +import org.apache.mahout.math.function.VectorFunction
          +import scalabindings._
          +import scalabindings.RLikeOps._
          +import drm.RLikeDrmOps._
          — End diff –

          weird. this should be just

          import RLikeDrmOps._

          (since math._ is already imported)

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16019136 — Diff: math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala — @@ -0,0 +1,72 @@ +package org.apache.mahout.classifier.naivebayes + +import org.apache.mahout.math._ +import org.apache.mahout.math.function.VectorFunction +import scalabindings._ +import scalabindings.RLikeOps._ +import drm.RLikeDrmOps._ — End diff – weird. this should be just import RLikeDrmOps._ (since math._ is already imported)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16019231

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala —
          @@ -15,7 +15,7 @@

          • limitations under the License.
            */

          -package org.apache.mahout.cf
          — End diff –

          hm. why any changes to Co-oc analysis? i think this needs merging with master with -Xtheirs

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16019231 — Diff: math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala — @@ -15,7 +15,7 @@ limitations under the License. */ -package org.apache.mahout.cf — End diff – hm. why any changes to Co-oc analysis? i think this needs merging with master with -Xtheirs
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51663705

          I think the patch went out of sync with current master pretty badly. may require merge, optionally with `-Xtheirs` if conflicts become too ugly.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51663705 I think the patch went out of sync with current master pretty badly. may require merge, optionally with `-Xtheirs` if conflicts become too ugly.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51664409

          Ugg- sorry about that- my local repo went out of wack when I was merging something else. Thanks.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51664409 Ugg- sorry about that- my local repo went out of wack when I was merging something else. Thanks.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51671230

          OK.. rebased on master. Not sure exactly what happened there. everything should be up to date.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51671230 OK.. rebased on master. Not sure exactly what happened there. everything should be up to date.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16022926

          — Diff: spark/src/test/scala/org/apache/mahout/classifier/naivebayes/NBSparkTestSuite.scala —
          @@ -0,0 +1,11 @@
          +package org.apache.mahout.classifier.naivebayes
          +
          +import org.apache.mahout.math._
          +import org.apache.mahout.math.scalabindings.RLikeOps._
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.sparkbindings.test.DistributedSparkSuite
          +import org.apache.mahout.test.MahoutSuite
          +import org.scalatest.FunSuite
          +
          +class NBSparkTestSuite extends FunSuite with MahoutSuite with DistributedSparkSuite with NBTestBase

          { +}

          — End diff –

          please remove `{ }` for classes with no body

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16022926 — Diff: spark/src/test/scala/org/apache/mahout/classifier/naivebayes/NBSparkTestSuite.scala — @@ -0,0 +1,11 @@ +package org.apache.mahout.classifier.naivebayes + +import org.apache.mahout.math._ +import org.apache.mahout.math.scalabindings.RLikeOps._ +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.sparkbindings.test.DistributedSparkSuite +import org.apache.mahout.test.MahoutSuite +import org.scalatest.FunSuite + +class NBSparkTestSuite extends FunSuite with MahoutSuite with DistributedSparkSuite with NBTestBase { +} — End diff – please remove `{ }` for classes with no body
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16022935

          — Diff: math-scala/pom.xml —
          @@ -167,6 +167,12 @@
          <artifactId>mahout-math</artifactId>
          </dependency>

          + <dependency>
          + <groupId>org.apache.mahout</groupId>
          + <artifactId>mahout-mrlegacy</artifactId>
          + </dependency>
          +
          — End diff –

          what is using mr-legacy here?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16022935 — Diff: math-scala/pom.xml — @@ -167,6 +167,12 @@ <artifactId>mahout-math</artifactId> </dependency> + <dependency> + <groupId>org.apache.mahout</groupId> + <artifactId>mahout-mrlegacy</artifactId> + </dependency> + — End diff – what is using mr-legacy here?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51671505

          Looks much -much cleaner, thank you Andrew!.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51671505 Looks much -much cleaner, thank you Andrew!.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16022971

          — Diff: math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala —
          @@ -0,0 +1,72 @@
          +package org.apache.mahout.classifier.naivebayes
          +
          +import org.apache.mahout.math._
          +import org.apache.mahout.math.function.VectorFunction
          +import scalabindings._
          +import scalabindings.RLikeOps._
          +import drm.RLikeDrmOps._
          — End diff –

          Yeah I was having alot of trouble with the imports. I'm new to scala. I'd tried your import suggestions, but was having trouble with them. I was getting errors when I was using:

          import RLikeDrmOps._

          I put the package in:

          org.apache.mahout.classifier.naivebayes

          so it is a level down from `org.apache.mahout.math` so from what I was thinking it needed the `drm.RLikeDrmOps._`

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16022971 — Diff: math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala — @@ -0,0 +1,72 @@ +package org.apache.mahout.classifier.naivebayes + +import org.apache.mahout.math._ +import org.apache.mahout.math.function.VectorFunction +import scalabindings._ +import scalabindings.RLikeOps._ +import drm.RLikeDrmOps._ — End diff – Yeah I was having alot of trouble with the imports. I'm new to scala. I'd tried your import suggestions, but was having trouble with them. I was getting errors when I was using: import RLikeDrmOps._ I put the package in: org.apache.mahout.classifier.naivebayes so it is a level down from `org.apache.mahout.math` so from what I was thinking it needed the `drm.RLikeDrmOps._`
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16022978

          — Diff: math-scala/pom.xml —
          @@ -167,6 +167,12 @@
          <artifactId>mahout-math</artifactId>
          </dependency>

          + <dependency>
          + <groupId>org.apache.mahout</groupId>
          + <artifactId>mahout-mrlegacy</artifactId>
          + </dependency>
          +
          — End diff –

          NaiveBayesModel is a java class in mr-legacy

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16022978 — Diff: math-scala/pom.xml — @@ -167,6 +167,12 @@ <artifactId>mahout-math</artifactId> </dependency> + <dependency> + <groupId>org.apache.mahout</groupId> + <artifactId>mahout-mrlegacy</artifactId> + </dependency> + — End diff – NaiveBayesModel is a java class in mr-legacy
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51671783

          Thanks for the reviews!

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51671783 Thanks for the reviews!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r16023021

          — Diff: math-scala/pom.xml —
          @@ -167,6 +167,12 @@
          <artifactId>mahout-math</artifactId>
          </dependency>

          + <dependency>
          + <groupId>org.apache.mahout</groupId>
          + <artifactId>mahout-mrlegacy</artifactId>
          + </dependency>
          +
          — End diff –

          ok

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r16023021 — Diff: math-scala/pom.xml — @@ -167,6 +167,12 @@ <artifactId>mahout-math</artifactId> </dependency> + <dependency> + <groupId>org.apache.mahout</groupId> + <artifactId>mahout-mrlegacy</artifactId> + </dependency> + — End diff – ok
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51671975

          ok i suppose since the test passes we can merge that in now, and tweak any problems later, in order not to have to handle any more conflicts...

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51671975 ok i suppose since the test passes we can merge that in now, and tweak any problems later, in order not to have to handle any more conflicts...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-51672766

          It still needs alot of work before its really a functioning Classifier. At the moment it assumes that each instance is a uniqe class- It probably needs to go back to accepting an array as Sebastian originally had it. It definitly needs some more tests. But i'm fine with merging this in as a base now and working on it from there. It definitly will make it easier to work with!

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51672766 It still needs alot of work before its really a functioning Classifier. At the moment it assumes that each instance is a uniqe class- It probably needs to go back to accepting an array as Sebastian originally had it. It definitly needs some more tests. But i'm fine with merging this in as a base now and working on it from there. It definitly will make it easier to work with!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo closed the pull request at:

          https://github.com/apache/mahout/pull/32

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo closed the pull request at: https://github.com/apache/mahout/pull/32
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-54686571

          is this abandoned?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-54686571 is this abandoned?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-54692047

          Not at all- I was just closing it up to keep the PR list clean while i did some work on it (was hoping to soon). Should just merge it as we discussed and then work from there? Was going to do a little more work on it but I'm not sure how long that will take.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-54692047 Not at all- I was just closing it up to keep the PR list clean while i did some work on it (was hoping to soon). Should just merge it as we discussed and then work from there? Was going to do a little more work on it but I'm not sure how long that will take.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-54692207

          No, do what you think is right... just was checking.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-54692207 No, do what you think is right... just was checking.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-54693386

          cool- thx. I plan on focusing on this for a while now (when I can). I have to look at it closely again.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-54693386 cool- thx. I plan on focusing on this for a while now (when I can). I have to look at it closely again.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrewpalumbo reopened a pull request:

          https://github.com/apache/mahout/pull/32

          MAHOUT-1493 Port Naive Bayes to Spark DSL (for comment)

          Created this PR to review the patch provided by Christoph Viebig and team, Students of TU, Berlin. It is based on Sebastian's code from before the DSL was abstracted away from Spark.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/32.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #32


          commit fcde47a0ed2708e991c302f152fdb2afde37a44d
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-07-18T04:55:22Z

          applied MAHOUT-1493.patch by cviebig and team from TU Berlin

          commit 0b985daa62d756d4b34aaecc3a6b97cc75168b6a
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-07-18T04:59:24Z

          applied MAHOUT-1493a patch: edits to NaiveBayesModel call

          commit 8a769711cff7743be6a41f784edc335778d497c4
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-07T05:51:21Z

          Moved NaiveBayes.scala to math-scala module under o.a.m.classification. Made several modifications to get more Mahout DSL-like

          commit de202e9b1c367d68a3778ef41dae6554958459d2
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-07T07:10:41Z

          Removed from drm package. More DSl stuff

          commit e331ffd4059ddd914f78252eb2229c89a1ed6622
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-07T22:50:35Z

          Added mrlegacy dependency and imports

          commit 796b8e3bb129be55ef5873df062b2b22c2f2ac78
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-08T00:59:31Z

          Moved NaiveBayes.scala to package o.a.m.classifier.naivebayes

          commit ca08b5ae422010d73b801cee2a8d8392b9028301
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-08T02:38:30Z

          Fixed imports and other small things. Compiles now.

          commit 848e8ff8becfa9446a643a511194dd3ad3892163
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-08T08:25:03Z

          Changed constructor to accept DrmLike rather than an array of DrmLike. Added in a very basic test from. Test passes.Hacked a bunch of NaiveBayes.scala together so needs alot of cleanup. Basic skeleton of NB

          commit aaa81774c7617c8f436717c040d6d0d166091440
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-08T21:21:50Z

          Clean up a few things

          commit a55ecb5fb709e030e4964118fab58ec3708fbe1d
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-08T21:57:36Z

          Clean up a few more things

          commit 337ad5ffc3ac047590066432049db6d3c07d86ed
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-08-29T15:28:30Z

          Style cleanup

          commit bc462fd925004c33a04f1d2fea387978999cf25f
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T02:28:16Z

          First attempt at combiner for seq2sparse outputted vectors

          commit e8a40aabd47af024ab02b47dc1769fd18d2679a8
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T02:44:51Z

          A few fixes and some cleanup. Compiles.

          commit bbaf2f356c3b86badfd9cb8778475f52c80f24f8
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T02:55:33Z

          Fixed second loop in Comibiner, cleanup, comments

          commit a9bbcd5f97b45ad386df2f23d472e7501db8845d
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T03:08:39Z

          Remove unused imoport, add some comments

          commit 6be1b93f5e3b86c36a74ea42b68436dedfb1c8b7
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T14:05:32Z

          comments, javadoc, liscense

          commit 557ff0a4b1e048cd9cf1f3f03f2b7d5f23c610aa
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T14:20:45Z

          change blockB to blockA.like and setter to setQuick

          commit 0b5238faa513033c06512aad9941baa6feb5c355
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T14:33:53Z

          change trainNB to accept a DrmLike[Int]

          commit e006fc3d58fb6cd85f5093c5c0641b177d58f793
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T14:39:36Z

          comments

          commit fa8fc27f6572d2edca136f0c19f7bca857f68195
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-11T16:59:45Z

          Change extractLabelsAndAggregateObservations to take a DrmLike[K] rather than [String]

          commit f510242391a93f0afd61c822a5c579cb38f65123
          Author: Andrew Palumbo <ap.dev@outlook.com>
          Date: 2014-09-13T17:57:45Z

          Cleanup, comments


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrewpalumbo reopened a pull request: https://github.com/apache/mahout/pull/32 MAHOUT-1493 Port Naive Bayes to Spark DSL (for comment) Created this PR to review the patch provided by Christoph Viebig and team, Students of TU, Berlin. It is based on Sebastian's code from before the DSL was abstracted away from Spark. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/32.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #32 commit fcde47a0ed2708e991c302f152fdb2afde37a44d Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-07-18T04:55:22Z applied MAHOUT-1493 .patch by cviebig and team from TU Berlin commit 0b985daa62d756d4b34aaecc3a6b97cc75168b6a Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-07-18T04:59:24Z applied MAHOUT-1493 a patch: edits to NaiveBayesModel call commit 8a769711cff7743be6a41f784edc335778d497c4 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-07T05:51:21Z Moved NaiveBayes.scala to math-scala module under o.a.m.classification. Made several modifications to get more Mahout DSL-like commit de202e9b1c367d68a3778ef41dae6554958459d2 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-07T07:10:41Z Removed from drm package. More DSl stuff commit e331ffd4059ddd914f78252eb2229c89a1ed6622 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-07T22:50:35Z Added mrlegacy dependency and imports commit 796b8e3bb129be55ef5873df062b2b22c2f2ac78 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-08T00:59:31Z Moved NaiveBayes.scala to package o.a.m.classifier.naivebayes commit ca08b5ae422010d73b801cee2a8d8392b9028301 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-08T02:38:30Z Fixed imports and other small things. Compiles now. commit 848e8ff8becfa9446a643a511194dd3ad3892163 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-08T08:25:03Z Changed constructor to accept DrmLike rather than an array of DrmLike. Added in a very basic test from. Test passes.Hacked a bunch of NaiveBayes.scala together so needs alot of cleanup. Basic skeleton of NB commit aaa81774c7617c8f436717c040d6d0d166091440 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-08T21:21:50Z Clean up a few things commit a55ecb5fb709e030e4964118fab58ec3708fbe1d Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-08T21:57:36Z Clean up a few more things commit 337ad5ffc3ac047590066432049db6d3c07d86ed Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-08-29T15:28:30Z Style cleanup commit bc462fd925004c33a04f1d2fea387978999cf25f Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T02:28:16Z First attempt at combiner for seq2sparse outputted vectors commit e8a40aabd47af024ab02b47dc1769fd18d2679a8 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T02:44:51Z A few fixes and some cleanup. Compiles. commit bbaf2f356c3b86badfd9cb8778475f52c80f24f8 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T02:55:33Z Fixed second loop in Comibiner, cleanup, comments commit a9bbcd5f97b45ad386df2f23d472e7501db8845d Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T03:08:39Z Remove unused imoport, add some comments commit 6be1b93f5e3b86c36a74ea42b68436dedfb1c8b7 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T14:05:32Z comments, javadoc, liscense commit 557ff0a4b1e048cd9cf1f3f03f2b7d5f23c610aa Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T14:20:45Z change blockB to blockA.like and setter to setQuick commit 0b5238faa513033c06512aad9941baa6feb5c355 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T14:33:53Z change trainNB to accept a DrmLike [Int] commit e006fc3d58fb6cd85f5093c5c0641b177d58f793 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T14:39:36Z comments commit fa8fc27f6572d2edca136f0c19f7bca857f68195 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-11T16:59:45Z Change extractLabelsAndAggregateObservations to take a DrmLike [K] rather than [String] commit f510242391a93f0afd61c822a5c579cb38f65123 Author: Andrew Palumbo <ap.dev@outlook.com> Date: 2014-09-13T17:57:45Z Cleanup, comments
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-55501933

          This is still very much "java in scala", and needs alot of work (not even sure if what im referring to as the "Combiner" even works . But conceptually training should work through to NaiveBayesModel on any engine. NaiveBayesModel is an MRLegacy object.

          My thinking is that extractLabelsAndAggregateObservations can be overridden and optimized for any engine.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-55501933 This is still very much "java in scala", and needs alot of work (not even sure if what im referring to as the "Combiner" even works . But conceptually training should work through to NaiveBayesModel on any engine. NaiveBayesModel is an MRLegacy object. My thinking is that extractLabelsAndAggregateObservations can be overridden and optimized for any engine.
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment -

          Sebastian Schelter I've been doing some work here that is almost ready to commit- do you mind if I re-assign this to myself?

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - Sebastian Schelter I've been doing some work here that is almost ready to commit- do you mind if I re-assign this to myself?
          Hide
          dlyubimov Dmitriy Lyubimov added a comment -

          I am guessing no answer means ok
          @Andrew please go ahead it makes sense to press on on this.

          Show
          dlyubimov Dmitriy Lyubimov added a comment - I am guessing no answer means ok @Andrew please go ahead it makes sense to press on on this.
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment -

          Thanks Dmitriy, I'll get back to work on this shortly. Appreciate it!

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - Thanks Dmitriy, I'll get back to work on this shortly. Appreciate it!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-65961642

          I’ve had some time to work on this recently and have Naïve Bayes implemented in the math-scala and spark modules with basic CLI drivers (for Spark) and working with the classify-20newsgroups.sh example. It still needs some refactoring and cleanup. In the math-scala implementation, there are some problems with the `NaiveBayes.extractLabelsAndAggregateObservations(…)` function in NaiveBayes.scala. I’ve put together a hack using `mapBlock(…)` which works, but very inefficiently (uses 2 transposes on a large set and ). Not very pretty. As well in order to convert from a String-Keyed to an Int-Keyed DRM it is necessary to collect the entire dataset up front. I’m going to investigate a .toIntKeyed() function for DrmLike though I’m still not all that familiar with the RDD backing. So currently I’m just overriding it in a SparkNaiveBayes object in the spark package.

          There are three (trivial) MRLegacy dependencies: `ComplementaryThetaTrainer`, `ResultAnalyzer` and ClassifierResult. I’ll probably port `ComplementaryThetaTrainer` into the Math-Scala module. Maybe `ComplementaryThetaTrainer` and `ResultAnalyzer` are candidates to be moved into Mahout-Math. I’m not sure if we’re trying to keep MRLegacy out of Math-Scala?
          A few issues that I still need to work out:

          1. A bug in the H20Drm –` H20Helper.java` uses a `water.fvec.Vec` vector to store label keys but `water.fvec.Vec` does not accept String values (throws an error). So we need a new distributed data structure to store String Row Labels for H20 DRMs. Otherwise this should be working for H20.

          2. Need a distributed method for conversion for the conversion from String to Int-Keyed DRMs (as mentioned above)

          3. As is, the `NBModel` is not fully serializable (several `RandomAccessSparseVectors` fields and a `Matrix` field in the class). I’m still working out a way to broadcast it to the mapBlock closure. Should be an relatively straightforward fix. So NaiveBayes.test(…) is running sequentially for now, and collecting the full test set up front. I haven’t looked at it too closely, but I’m wondering if we can add a `drmBroadcast(…)` method for arbitrary (serializable) Broadcast Objects?

          Next steps:

          1. Address above issues.

          2. Add in more CLI options.

          3. Add more tests.

          4. Implement NaiveBayes.classifyNew(…) method as outlined in MAHOUT-1564.

          If there are no objections, I’ll probably commit this after a bit more cleanup, and after some style fixes, and then work on the next steps from there.

          Any input is appreciated.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-65961642 I’ve had some time to work on this recently and have Naïve Bayes implemented in the math-scala and spark modules with basic CLI drivers (for Spark) and working with the classify-20newsgroups.sh example. It still needs some refactoring and cleanup. In the math-scala implementation, there are some problems with the `NaiveBayes.extractLabelsAndAggregateObservations(…)` function in NaiveBayes.scala. I’ve put together a hack using `mapBlock(…)` which works, but very inefficiently (uses 2 transposes on a large set and ). Not very pretty. As well in order to convert from a String-Keyed to an Int-Keyed DRM it is necessary to collect the entire dataset up front. I’m going to investigate a .toIntKeyed() function for DrmLike though I’m still not all that familiar with the RDD backing. So currently I’m just overriding it in a SparkNaiveBayes object in the spark package. There are three (trivial) MRLegacy dependencies: `ComplementaryThetaTrainer`, `ResultAnalyzer` and ClassifierResult. I’ll probably port `ComplementaryThetaTrainer` into the Math-Scala module. Maybe `ComplementaryThetaTrainer` and `ResultAnalyzer` are candidates to be moved into Mahout-Math. I’m not sure if we’re trying to keep MRLegacy out of Math-Scala? A few issues that I still need to work out: 1. A bug in the H20Drm –` H20Helper.java` uses a `water.fvec.Vec` vector to store label keys but `water.fvec.Vec` does not accept String values (throws an error). So we need a new distributed data structure to store String Row Labels for H20 DRMs. Otherwise this should be working for H20. 2. Need a distributed method for conversion for the conversion from String to Int-Keyed DRMs (as mentioned above) 3. As is, the `NBModel` is not fully serializable (several `RandomAccessSparseVectors` fields and a `Matrix` field in the class). I’m still working out a way to broadcast it to the mapBlock closure. Should be an relatively straightforward fix. So NaiveBayes.test(…) is running sequentially for now, and collecting the full test set up front. I haven’t looked at it too closely, but I’m wondering if we can add a `drmBroadcast(…)` method for arbitrary (serializable) Broadcast Objects? Next steps: 1. Address above issues. 2. Add in more CLI options. 3. Add more tests. 4. Implement NaiveBayes.classifyNew(…) method as outlined in MAHOUT-1564 . If there are no objections, I’ll probably commit this after a bit more cleanup, and after some style fixes, and then work on the next steps from there. Any input is appreciated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-65961981

          On Sun Dec 07 2014 at 3:43:50 PM Andrew Palumbo <notifications@github.com>
          wrote:

          >
          >
          > 1.
          >
          > A bug in the H20Drm –H20Helper.java uses a water.fvec.Vec vector to
          > store label keys but water.fvec.Vec does not accept String values
          > (throws an error). So we need a new distributed data structure to store
          > String Row Labels for H20 DRMs. Otherwise this should be working for H20.
          >
          >
          H2ODrm actually stores label keys as ValueString internally (by converting
          java String to/fro). Can you point to the exact line which is failing (or
          give the backtrace), I can have a quick look / fix it.

          Thanks

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-65961981 On Sun Dec 07 2014 at 3:43:50 PM Andrew Palumbo <notifications@github.com> wrote: > > > 1. > > A bug in the H20Drm –H20Helper.java uses a water.fvec.Vec vector to > store label keys but water.fvec.Vec does not accept String values > (throws an error). So we need a new distributed data structure to store > String Row Labels for H20 DRMs. Otherwise this should be working for H20. > > H2ODrm actually stores label keys as ValueString internally (by converting java String to/fro). Can you point to the exact line which is failing (or give the backtrace), I can have a quick look / fix it. Thanks
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r21430349

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            writer.set(r, rmap.get(r));
              • End diff –

          Hey anand, getting an error here.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r21430349 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); End diff – Hey anand, getting an error here.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r21430551

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            writer.set(r, rmap.get(r));
              • End diff –

          Ah, patch coming up..

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r21430551 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); End diff – Ah, patch coming up..
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r21430555

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            writer.set(r, rmap.get(r));
              • End diff –

          Please try this patch:

          diff --git a/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java b/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java
          index 294ec7e..c3b1339 100644
          — a/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java
          +++ b/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java
          @@ -329,7 +329,7 @@ public class H2OHelper {
          Map<Integer,String> rmap = reverseMap(map);

          for (long r = 0; r < m.rowSize(); r++)

          { - writer.set(r, rmap.get(r)); + labels.chunkForRow(r).set(r, rmap.get(r)); }

          writer.close(closer);

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r21430555 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); End diff – Please try this patch: diff --git a/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java b/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java index 294ec7e..c3b1339 100644 — a/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java +++ b/h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java @@ -329,7 +329,7 @@ public class H2OHelper { Map<Integer,String> rmap = reverseMap(map); for (long r = 0; r < m.rowSize(); r++) { - writer.set(r, rmap.get(r)); + labels.chunkForRow(r).set(r, rmap.get(r)); } writer.close(closer);
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r21430646

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            writer.set(r, rmap.get(r));
              • End diff –

          Argh, the formatting got messed up. Please replace "writer.set(r, rmap.get(r));" with "labels.chunkForRow(r).set(r, rmap.get(r));"

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r21430646 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); End diff – Argh, the formatting got messed up. Please replace "writer.set(r, rmap.get(r));" with "labels.chunkForRow(r).set(r, rmap.get(r));"
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r21430707

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            writer.set(r, rmap.get(r));
              • End diff –

          Please merge https://github.com/apache/mahout/pull/64 to fix this bug.

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r21430707 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,8 +327,9 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); End diff – Please merge https://github.com/apache/mahout/pull/64 to fix this bug.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r21431045

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
          • writer.set(r, rmap.get(r));
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            + //writer.set(r, rmap.get(r));
            + labels.chunkForRow(r).set(r, rmap.get(r));
              • End diff –

          Thanks again for looking at this, Anand - I'm getting the following with patch applied and the h2o tests enabled. Appreciate it!

          • NB Aggregator *** FAILED ***
            java.lang.IllegalArgumentException: Not a String
            at water.fvec.Chunk.set_impl(Chunk.java:189)
            at water.fvec.Chunk.set0(Chunk.java:158)
            at water.fvec.Chunk.set(Chunk.java:105)
            at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:334)
            at org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
            at org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
            at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply$mcV$sp(NBTestBase.scala:90)
            at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:69)
            at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:69)
            at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
            ...
          • Model DFS Serialization *** FAILED ***
            java.lang.IllegalArgumentException: Not a String
            at water.fvec.Chunk.set_impl(Chunk.java:189)
            at water.fvec.Chunk.set0(Chunk.java:158)
            at water.fvec.Chunk.set(Chunk.java:105)
            at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:334)
            at org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
            at org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
            at org.apache.mahout.classifier.naivebayes.NBModel.dfsWrite(NBModel.scala:144)
            at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$3.apply$mcV$sp(NBTestBase.scala:142)
            at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$3.apply(NBTestBase.scala:117)
            at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$3.apply(NBTestBase.scala:117)
            ...
          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r21431045 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { + //writer.set(r, rmap.get(r)); + labels.chunkForRow(r).set(r, rmap.get(r)); End diff – Thanks again for looking at this, Anand - I'm getting the following with patch applied and the h2o tests enabled. Appreciate it! NB Aggregator *** FAILED *** java.lang.IllegalArgumentException: Not a String at water.fvec.Chunk.set_impl(Chunk.java:189) at water.fvec.Chunk.set0(Chunk.java:158) at water.fvec.Chunk.set(Chunk.java:105) at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:334) at org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83) at org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67) at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply$mcV$sp(NBTestBase.scala:90) at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:69) at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:69) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) ... Model DFS Serialization *** FAILED *** java.lang.IllegalArgumentException: Not a String at water.fvec.Chunk.set_impl(Chunk.java:189) at water.fvec.Chunk.set0(Chunk.java:158) at water.fvec.Chunk.set(Chunk.java:105) at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:334) at org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83) at org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67) at org.apache.mahout.classifier.naivebayes.NBModel.dfsWrite(NBModel.scala:144) at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$3.apply$mcV$sp(NBTestBase.scala:142) at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$3.apply(NBTestBase.scala:117) at org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$3.apply(NBTestBase.scala:117) ...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-67242761

          I'm going to wait to push this until i figure out the best way to get all of the MRLegacy dependencies out.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-67242761 I'm going to wait to push this until i figure out the best way to get all of the MRLegacy dependencies out.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r22009635

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
          • writer.set(r, rmap.get(r));
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            + //writer.set(r, rmap.get(r));
            + labels.chunkForRow(r).set(r, rmap.get(r));
              • End diff –

          looks like the exception is being thrown here:
          ```java
          water.fvec.Chunk.set_impl(Chunk.java:189)

          boolean set_impl (int idx, String str)

          { throw new IllegalArgumentException("Not a String"); }

          ```
          I've been looking through some of the more recent code and it seems that the `Vec.Writer.set( long i, String str)` signature has been phased out.

          We should probably address this as a separate issue since we have to update to a newer h2o-core artifact, and there may be some other issues in there too.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r22009635 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { + //writer.set(r, rmap.get(r)); + labels.chunkForRow(r).set(r, rmap.get(r)); End diff – looks like the exception is being thrown here: ```java water.fvec.Chunk.set_impl(Chunk.java:189) boolean set_impl (int idx, String str) { throw new IllegalArgumentException("Not a String"); } ``` I've been looking through some of the more recent code and it seems that the `Vec.Writer.set( long i, String str)` signature has been phased out. We should probably address this as a separate issue since we have to update to a newer h2o-core artifact, and there may be some other issues in there too.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r22009885

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
          • writer.set(r, rmap.get(r));
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            + //writer.set(r, rmap.get(r));
            + labels.chunkForRow(r).set(r, rmap.get(r));
              • End diff –

          I am looking into this. Have been a bit busy, will get back on this soon.

          On Wed, Dec 17, 2014, 14:22 Andrew Palumbo <notifications@github.com> wrote:

          > In h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java
          > <https://github.com/apache/mahout/pull/32#discussion-diff-22009635>:
          >
          > > @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          > > labels = frame.anyVec().makeZero();
          > > Vec.Writer writer = labels.open();
          > > Map<Integer,String> rmap = reverseMap(map);
          > > -
          > > - for (long r = 0; r < m.rowSize(); r++) {
          > > - writer.set(r, rmap.get(r));
          > > + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
          > > + // TODO: need a new distributed data structure for storing String keys.
          > > + for (int r = 0; r < m.rowSize(); r++) {
          > > + //writer.set(r, rmap.get(r));
          > > + labels.chunkForRow(r).set(r, rmap.get(r));
          >
          > looks like the exception is being thrown here:
          >
          > water.fvec.Chunk.set_impl(Chunk.java:189)
          > boolean set_impl (int idx, String str)

          { throw new IllegalArgumentException("Not a String"); }

          >
          > I've been looking through some of the more recent code and it seems that
          > the Vec.Writer.set( long i, String str) signature has been phased out.
          >
          > We should probably address this as a separate issue since we have to
          > update to a newer h2o-core artifact, and there may be some other issues in
          > there too.
          >
          > —
          > Reply to this email directly or view it on GitHub
          > <https://github.com/apache/mahout/pull/32/files#r22009635>.
          >

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r22009885 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { + //writer.set(r, rmap.get(r)); + labels.chunkForRow(r).set(r, rmap.get(r)); End diff – I am looking into this. Have been a bit busy, will get back on this soon. On Wed, Dec 17, 2014, 14:22 Andrew Palumbo <notifications@github.com> wrote: > In h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java > < https://github.com/apache/mahout/pull/32#discussion-diff-22009635 >: > > > @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { > > labels = frame.anyVec().makeZero(); > > Vec.Writer writer = labels.open(); > > Map<Integer,String> rmap = reverseMap(map); > > - > > - for (long r = 0; r < m.rowSize(); r++) { > > - writer.set(r, rmap.get(r)); > > + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values > > + // TODO: need a new distributed data structure for storing String keys. > > + for (int r = 0; r < m.rowSize(); r++) { > > + //writer.set(r, rmap.get(r)); > > + labels.chunkForRow(r).set(r, rmap.get(r)); > > looks like the exception is being thrown here: > > water.fvec.Chunk.set_impl(Chunk.java:189) > boolean set_impl (int idx, String str) { throw new IllegalArgumentException("Not a String"); } > > I've been looking through some of the more recent code and it seems that > the Vec.Writer.set( long i, String str) signature has been phased out. > > We should probably address this as a separate issue since we have to > update to a newer h2o-core artifact, and there may be some other issues in > there too. > > — > Reply to this email directly or view it on GitHub > < https://github.com/apache/mahout/pull/32/files#r22009635 >. >
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/32#discussion_r22010435

          — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java —
          @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) {
          labels = frame.anyVec().makeZero();
          Vec.Writer writer = labels.open();
          Map<Integer,String> rmap = reverseMap(map);
          -

          • for (long r = 0; r < m.rowSize(); r++) {
          • writer.set(r, rmap.get(r));
            + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values
            + // TODO: need a new distributed data structure for storing String keys.
            + for (int r = 0; r < m.rowSize(); r++) {
            + //writer.set(r, rmap.get(r));
            + labels.chunkForRow(r).set(r, rmap.get(r));
              • End diff –

          Cool thanks.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r22010435 — Diff: h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java — @@ -327,9 +327,11 @@ public static H2ODrm drmFromMatrix(Matrix m, int minHint, int exactHint) { labels = frame.anyVec().makeZero(); Vec.Writer writer = labels.open(); Map<Integer,String> rmap = reverseMap(map); - for (long r = 0; r < m.rowSize(); r++) { writer.set(r, rmap.get(r)); + // TODO: fix BUG here... h20 water.fvec.Vec does not accept String values + // TODO: need a new distributed data structure for storing String keys. + for (int r = 0; r < m.rowSize(); r++) { + //writer.set(r, rmap.get(r)); + labels.chunkForRow(r).set(r, rmap.get(r)); End diff – Cool thanks.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/32#issuecomment-67587905

          All MRLegacy dependencies are now removed. I ported `ConfusionMatrix`,`ClassifierResult` and `ResultAnalyzer` over to a new package `o.a.m.classifier.stats`. They had some dependencies on Statistics classes in `o.a.m.cf.taste.impl.common` so i brought them over too. They're all living in the `o.a.m.classifier.stats` package for now- but it may be useful to move them in the future. They still need documentation. I'll probably do a couple more cleanup checks and then push this in the next day or two.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-67587905 All MRLegacy dependencies are now removed. I ported `ConfusionMatrix`,`ClassifierResult` and `ResultAnalyzer` over to a new package `o.a.m.classifier.stats`. They had some dependencies on Statistics classes in `o.a.m.cf.taste.impl.common` so i brought them over too. They're all living in the `o.a.m.classifier.stats` package for now- but it may be useful to move them in the future. They still need documentation. I'll probably do a couple more cleanup checks and then push this in the next day or two.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/mahout/pull/32

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/32
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #2900 (See https://builds.apache.org/job/Mahout-Quality/2900/)
          MAHOUT-1493 Port Naive Bayes to Scala DSL (apalumbo) closes apache/mahout#32 (ap.dev: rev 310534319ae8df4cd95c3ff044afb6afdaee2605)

          • spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala
          • math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala
          • math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBClassifier.scala
          • bin/mahout
          • math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala
          • spark/src/main/scala/org/apache/mahout/classifier/naivebayes/SparkNaiveBayes.scala
          • math-scala/src/main/scala/org/apache/mahout/classifier/stats/ClassifierStats.scala
          • math-scala/src/test/scala/org/apache/mahout/classifier/stats/ClassifierStatsTestBase.scala
          • spark/src/test/scala/org/apache/mahout/classifier/naivebayes/NBSparkTestSuite.scala
          • math-scala/src/main/scala/org/apache/mahout/classifier/stats/ConfusionMatrix.scala
          • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala
          • h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java
          • CHANGELOG
          • math-scala/src/test/scala/org/apache/mahout/classifier/naivebayes/NBTestBase.scala
          • spark/src/test/scala/org/apache/mahout/classifier/stats/ClassifierStatsSparkTestSuite.scala
          • examples/bin/classify-20newsgroups.sh
          • spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/MahoutSparkILoop.scala
          • spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala
          • mrlegacy/src/test/java/org/apache/mahout/classifier/ConfusionMatrixTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2900 (See https://builds.apache.org/job/Mahout-Quality/2900/ ) MAHOUT-1493 Port Naive Bayes to Scala DSL (apalumbo) closes apache/mahout#32 (ap.dev: rev 310534319ae8df4cd95c3ff044afb6afdaee2605) spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBClassifier.scala bin/mahout math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala spark/src/main/scala/org/apache/mahout/classifier/naivebayes/SparkNaiveBayes.scala math-scala/src/main/scala/org/apache/mahout/classifier/stats/ClassifierStats.scala math-scala/src/test/scala/org/apache/mahout/classifier/stats/ClassifierStatsTestBase.scala spark/src/test/scala/org/apache/mahout/classifier/naivebayes/NBSparkTestSuite.scala math-scala/src/main/scala/org/apache/mahout/classifier/stats/ConfusionMatrix.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java CHANGELOG math-scala/src/test/scala/org/apache/mahout/classifier/naivebayes/NBTestBase.scala spark/src/test/scala/org/apache/mahout/classifier/stats/ClassifierStatsSparkTestSuite.scala examples/bin/classify-20newsgroups.sh spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/MahoutSparkILoop.scala spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala mrlegacy/src/test/java/org/apache/mahout/classifier/ConfusionMatrixTest.java
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment -

          I will still be adding tests, scaladocs and doing some refactoring. As well as more work on the CLI drivers.

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - I will still be adding tests, scaladocs and doing some refactoring. As well as more work on the CLI drivers.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #2901 (See https://builds.apache.org/job/Mahout-Quality/2901/)
          MAHOUT-1493 fix changelog and remove temporary hardcoded settings from spark-shell (ap.dev: rev 0f037cb03e77c09627c36962600e4fb00c4013c8)

          • spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/MahoutSparkILoop.scala
          • CHANGELOG
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2901 (See https://builds.apache.org/job/Mahout-Quality/2901/ ) MAHOUT-1493 fix changelog and remove temporary hardcoded settings from spark-shell (ap.dev: rev 0f037cb03e77c09627c36962600e4fb00c4013c8) spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/MahoutSparkILoop.scala CHANGELOG
          Hide
          pferrel Pat Ferrel added a comment -

          I did some simplification of the drivers which made your driver.start override unnecessary. So i commented them out. They do no harm but would if the default config ever changed.

          Pardon my ignorance of the naive bayes driver but do we really want to keep using sequence files? Even for spark-rowsimilarity we use delimited text to encode a DRM, adding the ability to have user specific IDs. This has the benefit of hiding the Mahout IDs from a CLI user, which seems to be the source of a great number of mistakes and mailing list questions.

          The internal item and row similarity code is structured around rdd backed DRMs so the DSL is honored. But for the CLI text makes input human readable and language neutral.

          Since we don't save around intermediate files anymore do we have to live with the binary format going forward?

          Show
          pferrel Pat Ferrel added a comment - I did some simplification of the drivers which made your driver.start override unnecessary. So i commented them out. They do no harm but would if the default config ever changed. Pardon my ignorance of the naive bayes driver but do we really want to keep using sequence files? Even for spark-rowsimilarity we use delimited text to encode a DRM, adding the ability to have user specific IDs. This has the benefit of hiding the Mahout IDs from a CLI user, which seems to be the source of a great number of mistakes and mailing list questions. The internal item and row similarity code is structured around rdd backed DRMs so the DSL is honored. But for the CLI text makes input human readable and language neutral. Since we don't save around intermediate files anymore do we have to live with the binary format going forward?
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment - - edited

          Thanks for looking at this Pat. My thinking up to now as far as using Sequence files is that the input to Naive Bayes (for now) would be the output from seq2sparse; a <Text,VectorWritable> Sequence file. This is really just to keep it simple for now and to, and the CLI drivers that I wrote are a rough cut that I basically copied and pasted from your Item-similarity drivers (very cool by the way).

          I'm all for taking other input formats. Especially since were hopefully moving on from MR seq2sparse soon. I need to read up on what you did with the Readers and Writers.

          The issue that alot of people are having on the list with MRLegacy NB (if I'm think of the issue same as you mentioned) is that each Vectorized Document has to contain a Category in its Key. Since the Mahout Process for text vectorization has been to seperate documents into directorys by their respective categories, then run `mahout seqdirectory` which converts each document into into a sequence file with \directory\doc_id as a key then seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was hardcoded as a split on "\" and labeling the document as with the first token after the split (the directory id). I've kind of taken this as the convention.

          Internally the documents ids are discarded and the TF-IDF weights are aggregated by category and then individually discarded.

          I've tried to relax the Category extraction convention in the DSL Naive Bayes by allowing the user developer the ability to pass an arbitrary `String => String` function in the aggregation constructor. This way the Labels can be extracted by e.g. a regex pattern. I'd like to incorporate this as an option into the CLI driver but haven't really made it this far yet.

          So I think that with any format, there may be some confusion over the document labeling. But I agree that we should support other file formats as input.

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - - edited Thanks for looking at this Pat. My thinking up to now as far as using Sequence files is that the input to Naive Bayes (for now) would be the output from seq2sparse; a <Text,VectorWritable> Sequence file. This is really just to keep it simple for now and to, and the CLI drivers that I wrote are a rough cut that I basically copied and pasted from your Item-similarity drivers (very cool by the way). I'm all for taking other input formats. Especially since were hopefully moving on from MR seq2sparse soon. I need to read up on what you did with the Readers and Writers. The issue that alot of people are having on the list with MRLegacy NB (if I'm think of the issue same as you mentioned) is that each Vectorized Document has to contain a Category in its Key. Since the Mahout Process for text vectorization has been to seperate documents into directorys by their respective categories, then run `mahout seqdirectory` which converts each document into into a sequence file with \directory\doc_id as a key then seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was hardcoded as a split on "\" and labeling the document as with the first token after the split (the directory id). I've kind of taken this as the convention. Internally the documents ids are discarded and the TF-IDF weights are aggregated by category and then individually discarded. I've tried to relax the Category extraction convention in the DSL Naive Bayes by allowing the user developer the ability to pass an arbitrary `String => String` function in the aggregation constructor. This way the Labels can be extracted by e.g. a regex pattern. I'd like to incorporate this as an option into the CLI driver but haven't really made it this far yet. So I think that with any format, there may be some confusion over the document labeling. But I agree that we should support other file formats as input.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrewpalumbo opened a pull request:

          https://github.com/apache/mahout/pull/72

          MAHOUT-1493-h2o: Enable Naive Bayes Tests in h2o Module

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493-h2o

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/72.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #72


          commit 0b204bbd3a6f8e532cf79f5d1a761e57c214b0d5
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-01-26T21:15:02Z

          Enable Naive Bayes Tests in h2o Module


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrewpalumbo opened a pull request: https://github.com/apache/mahout/pull/72 MAHOUT-1493 -h2o: Enable Naive Bayes Tests in h2o Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493 -h2o Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/72.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #72 commit 0b204bbd3a6f8e532cf79f5d1a761e57c214b0d5 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-01-26T21:15:02Z Enable Naive Bayes Tests in h2o Module
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/72#issuecomment-71567408

          Will merge after MAHOUT-1626 is resolved.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/72#issuecomment-71567408 Will merge after MAHOUT-1626 is resolved.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo closed the pull request at:

          https://github.com/apache/mahout/pull/72

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo closed the pull request at: https://github.com/apache/mahout/pull/72
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrewpalumbo reopened a pull request:

          https://github.com/apache/mahout/pull/72

          MAHOUT-1493-h2o: Enable Naive Bayes Tests in h2o Module

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493-h2o

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/72.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #72


          commit 0b204bbd3a6f8e532cf79f5d1a761e57c214b0d5
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-01-26T21:15:02Z

          Enable Naive Bayes Tests in h2o Module


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrewpalumbo reopened a pull request: https://github.com/apache/mahout/pull/72 MAHOUT-1493 -h2o: Enable Naive Bayes Tests in h2o Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493 -h2o Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/72.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #72 commit 0b204bbd3a6f8e532cf79f5d1a761e57c214b0d5 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-01-26T21:15:02Z Enable Naive Bayes Tests in h2o Module
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/72#issuecomment-71569572

          Will merge after MAHOUT-1638 is resolved.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/72#issuecomment-71569572 Will merge after MAHOUT-1638 is resolved.
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment -

          Still have a couple of things to do: General Cleanup, CLI changes doc comments, etc. Will do these under a new JIRA.

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - Still have a couple of things to do: General Cleanup, CLI changes doc comments, etc. Will do these under a new JIRA.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/78#issuecomment-85733101

          BTW @andrewpalumbo could you please change the header to show the issue like this: MAHOUT-1493, then close and repopen it. Once re-opened, @asfgit bot will start synching jira to reviews and backwards.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/78#issuecomment-85733101 BTW @andrewpalumbo could you please change the header to show the issue like this: MAHOUT-1493 , then close and repopen it. Once re-opened, @asfgit bot will start synching jira to reviews and backwards.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrewpalumbo reopened a pull request:

          https://github.com/apache/mahout/pull/78

          MAHOUT-1493 driver cleanup

          This removes parseIndexedDatasetFormatOptions() and some other option parsers not used by `spark-trainnb` and `spark-testnb`. There are still a couple of options that I need to add but first want to figure out how to handle the required `-o` option.

          Currently `spark-testnb` doesn't have any actual output. I was thinking about writing a confusion matrix as Text out to a file.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493-driver-cleanup

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/78.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #78


          commit d0efc75b44b749081fbc8395f4520bc38e8babc3
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-03-06T23:23:33Z

          Remove unused parsers

          commit 7a9dbb2f29d775f07b65c427da0fa029e3a4a4b5
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-03-06T23:45:23Z

          remove requirement for -o option from MahoutOptionParser

          commit 02de3edb5e408aa221e6e60121ab8fca11c72133
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-03-07T00:07:17Z

          pass complementary option through to training. cleanup.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrewpalumbo reopened a pull request: https://github.com/apache/mahout/pull/78 MAHOUT-1493 driver cleanup This removes parseIndexedDatasetFormatOptions() and some other option parsers not used by `spark-trainnb` and `spark-testnb`. There are still a couple of options that I need to add but first want to figure out how to handle the required `-o` option. Currently `spark-testnb` doesn't have any actual output. I was thinking about writing a confusion matrix as Text out to a file. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493 -driver-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/78.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #78 commit d0efc75b44b749081fbc8395f4520bc38e8babc3 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-03-06T23:23:33Z Remove unused parsers commit 7a9dbb2f29d775f07b65c427da0fa029e3a4a4b5 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-03-06T23:45:23Z remove requirement for -o option from MahoutOptionParser commit 02de3edb5e408aa221e6e60121ab8fca11c72133 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-03-07T00:07:17Z pass complementary option through to training. cleanup.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/78#issuecomment-85755842

          well if you decide to get another issue, just edit the header again and re-open again. since current issue seems to be closed. or just re-open the current jira issue.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/78#issuecomment-85755842 well if you decide to get another issue, just edit the header again and re-open again. since current issue seems to be closed. or just re-open the current jira issue.
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment -

          Reopening to finish some work and clean-up here:

          • fix #78 based on Pat's comments and push
          • add CLI options for aplha_i and -overwrite (save Regex key parser option for 0.10.1)
          • Currently testNB brings all code into Memory up front and runs sequentially (would like to get this fixed for 0.10.0)
            • Work is done except Classifier classes need to be fully serializable to mapBlock closures
          • Clean up code, comments, etc.
          Show
          Andrew_Palumbo Andrew Palumbo added a comment - Reopening to finish some work and clean-up here: fix #78 based on Pat's comments and push add CLI options for aplha_i and -overwrite (save Regex key parser option for 0.10.1) Currently testNB brings all code into Memory up front and runs sequentially (would like to get this fixed for 0.10.0) Work is done except Classifier classes need to be fully serializable to mapBlock closures Clean up code, comments, etc.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user avati commented on the pull request:

          https://github.com/apache/mahout/pull/72#issuecomment-89169192

          Please find fix for MAHOUT-1638 in #99. I have tested this PR on top of that patch and NB tests work fine.

          Show
          githubbot ASF GitHub Bot added a comment - Github user avati commented on the pull request: https://github.com/apache/mahout/pull/72#issuecomment-89169192 Please find fix for MAHOUT-1638 in #99. I have tested this PR on top of that patch and NB tests work fine.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/mahout/pull/72

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/72
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #3045 (See https://builds.apache.org/job/Mahout-Quality/3045/)
          MAHOUT-1638: H2O bindings fail at drmParallelizeWithRowLabels(...) closes apache/mahout#99 and MAHOUT-1493-h2o closes apache/mahout#72 (apalumbo: rev 5197ac9e8393763dd4ed58f5f1e9f167e75d3078)

          • CHANGELOG
          • h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java
          • h2o/src/test/scala/org/apache/mahout/classifier/stats/ClassifierStatsH2OTestSuite.scala
          • h2o/src/main/scala/org/apache/mahout/h2obindings/ops/MapBlockHelper.scala
          • h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java
          • h2o/src/test/scala/org/apache/mahout/classifier/naivebayes/NBH2OTestSuite.scala
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #3045 (See https://builds.apache.org/job/Mahout-Quality/3045/ ) MAHOUT-1638 : H2O bindings fail at drmParallelizeWithRowLabels(...) closes apache/mahout#99 and MAHOUT-1493 -h2o closes apache/mahout#72 (apalumbo: rev 5197ac9e8393763dd4ed58f5f1e9f167e75d3078) CHANGELOG h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java h2o/src/test/scala/org/apache/mahout/classifier/stats/ClassifierStatsH2OTestSuite.scala h2o/src/main/scala/org/apache/mahout/h2obindings/ops/MapBlockHelper.scala h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java h2o/src/test/scala/org/apache/mahout/classifier/naivebayes/NBH2OTestSuite.scala
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the pull request:

          https://github.com/apache/mahout/pull/78#issuecomment-89632105

          fixing in a different branch

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/78#issuecomment-89632105 fixing in a different branch
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo closed the pull request at:

          https://github.com/apache/mahout/pull/78

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo closed the pull request at: https://github.com/apache/mahout/pull/78
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #3057 (See https://builds.apache.org/job/Mahout-Quality/3057/)
          MAHOUT-1493: Naive Bayes CLI cleanup closes apache/mahout#102 (apalumbo: rev 1bcda32146713144eac4889c3065c448417c36f6)

          • spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala
          • spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala
          • examples/bin/classify-20newsgroups.sh
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #3057 (See https://builds.apache.org/job/Mahout-Quality/3057/ ) MAHOUT-1493 : Naive Bayes CLI cleanup closes apache/mahout#102 (apalumbo: rev 1bcda32146713144eac4889c3065c448417c36f6) spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala examples/bin/classify-20newsgroups.sh
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrewpalumbo opened a pull request:

          https://github.com/apache/mahout/pull/104

          MAHOUT-1493 parallelize SparkNaiveBayes.test(...)

          Explicitly define math-scala NaiveBayes.test(...) as sequential and in memory. Extend test(..) into SparkNaiveBayes and distribute the classification process. Also some general cleanup.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493-serialize

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/104.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #104


          commit 98fc94484a408ecc9e433babaee772258ba9bae9
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T01:27:47Z

          add a prefix directory for model.dfsWrite(..) to write components. added tests for full training and testing of seeded random toy TFIDF data

          commit 1049b461464ab27a56f742d96c5b5c73b040ec1b
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T01:33:23Z

          use SparkNaiveBayes rather than NaiveBayes in CLI drivers to avoid confusion

          commit 6c5dc8f3359255f85fbbbe2ffb24681fa489255d
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T02:40:44Z

          override NaiveBayes.test in Spark and broadcast the classifier to the closure. Now is no longer pulling everything into memory up frot

          commit 2522a029ecbf6389f09ab69f80f05a298bcd7ea0
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T02:55:08Z

          Make math-scala NaiveBayes.test(...) explictly sequential

          commit bad6d9a06e3e04cfdafcce7d1adc2066ecd991b7
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T03:29:36Z

          Cleanup


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrewpalumbo opened a pull request: https://github.com/apache/mahout/pull/104 MAHOUT-1493 parallelize SparkNaiveBayes.test(...) Explicitly define math-scala NaiveBayes.test(...) as sequential and in memory. Extend test(..) into SparkNaiveBayes and distribute the classification process. Also some general cleanup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493 -serialize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/104.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #104 commit 98fc94484a408ecc9e433babaee772258ba9bae9 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T01:27:47Z add a prefix directory for model.dfsWrite(..) to write components. added tests for full training and testing of seeded random toy TFIDF data commit 1049b461464ab27a56f742d96c5b5c73b040ec1b Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T01:33:23Z use SparkNaiveBayes rather than NaiveBayes in CLI drivers to avoid confusion commit 6c5dc8f3359255f85fbbbe2ffb24681fa489255d Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T02:40:44Z override NaiveBayes.test in Spark and broadcast the classifier to the closure. Now is no longer pulling everything into memory up frot commit 2522a029ecbf6389f09ab69f80f05a298bcd7ea0 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T02:55:08Z Make math-scala NaiveBayes.test(...) explictly sequential commit bad6d9a06e3e04cfdafcce7d1adc2066ecd991b7 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T03:29:36Z Cleanup
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/mahout/pull/104

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/104
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #3064 (See https://builds.apache.org/job/Mahout-Quality/3064/)
          MAHOUT-1493: parallelize SparkNaiveBayes.test(...) closes apache/mahout#104 (apalumbo: rev 1be5a00c809fe0c7d26fcad24536428b5f0b564e)

          • spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala
          • math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala
          • spark/src/main/scala/org/apache/mahout/classifier/naivebayes/SparkNaiveBayes.scala
          • math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala
          • spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
          • spark/src/test/scala/org/apache/mahout/classifier/naivebayes/NBSparkTestSuite.scala
          • spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala
          • math-scala/src/test/scala/org/apache/mahout/classifier/naivebayes/NBTestBase.scala
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #3064 (See https://builds.apache.org/job/Mahout-Quality/3064/ ) MAHOUT-1493 : parallelize SparkNaiveBayes.test(...) closes apache/mahout#104 (apalumbo: rev 1be5a00c809fe0c7d26fcad24536428b5f0b564e) spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala spark/src/main/scala/org/apache/mahout/classifier/naivebayes/SparkNaiveBayes.scala math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala spark/src/test/scala/org/apache/mahout/classifier/naivebayes/NBSparkTestSuite.scala spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala math-scala/src/test/scala/org/apache/mahout/classifier/naivebayes/NBTestBase.scala
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrewpalumbo opened a pull request:

          https://github.com/apache/mahout/pull/111

          MAHOUT-1493: Add CLI options for --overwrite and --alphaI to NB Drivers

          Presently `mahout spark-trainnb` will not complete if a model already exists in the output directory. These last options add in an `--overwrite` option to overwrite a model in the given output directory.

          as well:

          • add `.par(auto = true)` to the input Drm
          • ads a `delete(...)` method to `Hadppo1HDFSUtils` which does not handle any IO exceptions
          • adds an almost trivial `--alphaI` option to set the Laplace smoothing factor from the CLI

          This patch will complete the full port of the old MapReduce Naive Bayes to the `math-scala` and `spark` modules.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493g

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/111.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #111


          commit 37f40e1ba32871227cc025effe87b135b5c28f31
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T20:56:48Z

          set .par(auto = true) on input data in CLI Drivers

          commit 149b0d0ef4ee5ea6e161b669f09d4fa2eeb2fff3
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T22:12:27Z

          add CLI driver options for -ow and -alphaI. added a delete(...) method in hHadoop1HDFSUtil.

          commit 59ca8c1b2e960b5bfc10de39d5cd8e2bb0042c12
          Author: Andrew Palumbo <apalumbo@apache.org>
          Date: 2015-04-05T22:14:58Z

          Adjust Example accordingly


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrewpalumbo opened a pull request: https://github.com/apache/mahout/pull/111 MAHOUT-1493 : Add CLI options for --overwrite and --alphaI to NB Drivers Presently `mahout spark-trainnb` will not complete if a model already exists in the output directory. These last options add in an `--overwrite` option to overwrite a model in the given output directory. as well: add `.par(auto = true)` to the input Drm ads a `delete(...)` method to `Hadppo1HDFSUtils` which does not handle any IO exceptions adds an almost trivial `--alphaI` option to set the Laplace smoothing factor from the CLI This patch will complete the full port of the old MapReduce Naive Bayes to the `math-scala` and `spark` modules. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1493 g Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/111.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #111 commit 37f40e1ba32871227cc025effe87b135b5c28f31 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T20:56:48Z set .par(auto = true) on input data in CLI Drivers commit 149b0d0ef4ee5ea6e161b669f09d4fa2eeb2fff3 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T22:12:27Z add CLI driver options for -ow and -alphaI. added a delete(...) method in hHadoop1HDFSUtil. commit 59ca8c1b2e960b5bfc10de39d5cd8e2bb0042c12 Author: Andrew Palumbo <apalumbo@apache.org> Date: 2015-04-05T22:14:58Z Adjust Example accordingly
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/111#discussion_r27781100

          — Diff: spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala —
          @@ -76,7 +77,7 @@ object TestNBDriver extends MahoutSparkDriver {
          /** Read the test set from inputPath/part-x-00000 sequence file of form <Text,VectorWritable> */
          private def readTestSet: DrmLike[_] = {
          val inputPath = parser.opts("input").asInstanceOf[String]

          • val trainingSet = drm.drmDfsRead(inputPath)
            + val trainingSet = drm.drmDfsRead(inputPath).par(auto = true)
              • End diff –

          @dlyubimov, @pferrel -does adding `.par(auto = true)` here make sense? Otherwise there are no explicit partitioning instructions for the input data.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/111#discussion_r27781100 — Diff: spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala — @@ -76,7 +77,7 @@ object TestNBDriver extends MahoutSparkDriver { /** Read the test set from inputPath/part-x-00000 sequence file of form <Text,VectorWritable> */ private def readTestSet: DrmLike [_] = { val inputPath = parser.opts("input").asInstanceOf [String] val trainingSet = drm.drmDfsRead(inputPath) + val trainingSet = drm.drmDfsRead(inputPath).par(auto = true) End diff – @dlyubimov, @pferrel -does adding `.par(auto = true)` here make sense? Otherwise there are no explicit partitioning instructions for the input data.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/mahout/pull/111

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/111
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #3069 (See https://builds.apache.org/job/Mahout-Quality/3069/)
          MAHOUT-1493: Add CLI options for --overwrite and --alphaI to NB Drivers. closes apache/mahout#111 (apalumbo: rev 30fe6cc83f83578c922d8c01742ae7a6939280e8)

          • math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala
          • spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala
          • examples/bin/classify-20newsgroups.sh
          • spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala
          • spark/src/main/scala/org/apache/mahout/common/Hadoop1HDFSUtil.scala
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #3069 (See https://builds.apache.org/job/Mahout-Quality/3069/ ) MAHOUT-1493 : Add CLI options for --overwrite and --alphaI to NB Drivers. closes apache/mahout#111 (apalumbo: rev 30fe6cc83f83578c922d8c01742ae7a6939280e8) math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala examples/bin/classify-20newsgroups.sh spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala spark/src/main/scala/org/apache/mahout/common/Hadoop1HDFSUtil.scala
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Mahout-Quality #3070 (See https://builds.apache.org/job/Mahout-Quality/3070/)
          MAHOUT-1493: Fixed incorrect alphaI in NBModel (apalumbo: rev ebb39fe4ae917188b7a1ac235fd8d04e30c92cba)

          • math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #3070 (See https://builds.apache.org/job/Mahout-Quality/3070/ ) MAHOUT-1493 : Fixed incorrect alphaI in NBModel (apalumbo: rev ebb39fe4ae917188b7a1ac235fd8d04e30c92cba) math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NBModel.scala
          Hide
          sslavic Stevo Slavic added a comment -

          Bulk closing all 0.10.0 resolved issues

          Show
          sslavic Stevo Slavic added a comment - Bulk closing all 0.10.0 resolved issues

            People

            • Assignee:
              Andrew_Palumbo Andrew Palumbo
              Reporter:
              ssc Sebastian Schelter
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development