Mahout
  1. Mahout
  2. MAHOUT-228

Need sequential logistic regression implementation using SGD techniques

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4
    • Component/s: Classification
    • Labels:
      None

      Description

      Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).

      I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

      1. MAHOUT-228-interfaces.patch
        9 kB
        Ted Dunning
      2. TrainLogisticTest.patch
        3 kB
        Drew Farris
      3. MAHOUT-228.patch
        172 kB
        Ted Dunning
      4. MAHOUT-228.patch
        167 kB
        Ted Dunning
      5. MAHOUT-228.patch
        78 kB
        Ted Dunning
      6. MAHOUT-228.patch
        58 kB
        Jake Mannix
      7. MAHOUT-228-3.patch
        61 kB
        Ted Dunning
      8. sgd-derivation.pdf
        73 kB
        Ted Dunning
      9. sgd-derivation.tex
        5 kB
        Ted Dunning
      10. r.csv
        0.5 kB
        Ted Dunning
      11. logP.csv
        0.1 kB
        Ted Dunning
      12. sgd.csv
        8 kB
        Ted Dunning

        Issue Links

          Activity

          Hide
          Ted Dunning added a comment -

          Here is an early implementation. The learning has been implemented, but not tested. Most other aspects are reasonably well tested.

          Show
          Ted Dunning added a comment - Here is an early implementation. The learning has been implemented, but not tested. Most other aspects are reasonably well tested.
          Hide
          Ted Dunning added a comment -

          Here is the actual patch file.

          Show
          Ted Dunning added a comment - Here is the actual patch file.
          Hide
          Ted Dunning added a comment -


          This implementation is purely logistic regression. Changing to other supervised learning algorithms shouldn't be difficult and I have made the regularization pluggable, but I would as soon get this working as is before adding too much generality. In particular, I have strongly used the presumption that I can do sparse updates and lazy regularization. I don't know how much that applies to other problems.

          Show
          Ted Dunning added a comment - This implementation is purely logistic regression. Changing to other supervised learning algorithms shouldn't be difficult and I have made the regularization pluggable, but I would as soon get this working as is before adding too much generality. In particular, I have strongly used the presumption that I can do sparse updates and lazy regularization. I don't know how much that applies to other problems.
          Hide
          Jake Mannix added a comment -

          Ted, how do we get google-guava for this? Maven doesn't find it anywhere... I can download myself to try it out for now, I suppose.

          Show
          Jake Mannix added a comment - Ted, how do we get google-guava for this? Maven doesn't find it anywhere... I can download myself to try it out for now, I suppose.
          Hide
          Ted Dunning added a comment -

          Ted, how do we get google-guava for this? Maven doesn't find it anywhere... I can download myself to try it out for now, I suppose.

          Hmm... I bet somebody published to our company internal repository (we use guava and collections in several systems). Then I bet it wound up in my local repository and the mahout build picked it up from there.

          Let me go back and remove the use of guava for now. It is very nice to be able to read all the lines in a resource in one line, but not that important.

          Show
          Ted Dunning added a comment - Ted, how do we get google-guava for this? Maven doesn't find it anywhere... I can download myself to try it out for now, I suppose. Hmm... I bet somebody published to our company internal repository (we use guava and collections in several systems). Then I bet it wound up in my local repository and the mahout build picked it up from there. Let me go back and remove the use of guava for now. It is very nice to be able to read all the lines in a resource in one line, but not that important.
          Hide
          Ted Dunning added a comment -

          Updated to avoid googles guava libraries.

          Show
          Ted Dunning added a comment - Updated to avoid googles guava libraries.
          Hide
          Ted Dunning added a comment -

          I have been doing some testing on the training algorithm and there seems to be a glitch in it. The problem is that the prior gradient is strong enough that for lambda > really small, the regularization zeros out all of the coefficients on every iteration. Not good.

          I will attach some sample data that I have been using for these experiments. These reference for these experiments was an optimization I did in R where I explicitly optimized a simple example and got very plausible results.

          For the R example, I used the following definition of the function to optimize:

          f <- function(beta) {
              p = w(rowSums(x %*% matrix(beta, ncol=1)));
              r1 = -sum(y*log(p+(p==0))+(1-y)*log(1-p+(p==1))); 
              r2=lambda*sum(abs(beta)); 
              (r1+r2)
          }
          
          w <- function(x) {
              return(1/(1+exp(-x)))
          }
          

          Here beta is the coefficient vector, lambda sets the amount of regularization, x are the input vectors one observation per row, y are the known categories for the rows of x, f is the combined log likelihood (r1) and log prior (r2), and w is the logistic function. I used an unsimplified form for the overall logistic likelihood for simplicity. Normally, a simpler form is used of -sum(y - p), but I wanted to keep things straightforward.

          The attached file sgd.csv contains the value of x. The value of y is simply 30 0's followed by 30 1's.

          Optimization was done using this:

          lambda <- 0.1
          beta.01 <- optim(beta,f, method="CG", control=list(maxit=10000))
          lambda <- 1
          beta.1 <- optim(beta,f, method="CG", control=list(maxit=10000))
          lambda <- 10
          beta.10 <- optim(beta,f, method="CG", control=list(maxit=10000))
          

          The values for beta obtained are contained in the file r.csv and the log-MAP likelihoods are in logP.csv

          I will shortly add a patch that has my initial test in it. This patch will contain these test data files. I will be working on this problem off and on over the next few days, but any hints that anybody has are welcome. My expectation is that there is a silly oversight in my Java code.

          Show
          Ted Dunning added a comment - I have been doing some testing on the training algorithm and there seems to be a glitch in it. The problem is that the prior gradient is strong enough that for lambda > really small, the regularization zeros out all of the coefficients on every iteration. Not good. I will attach some sample data that I have been using for these experiments. These reference for these experiments was an optimization I did in R where I explicitly optimized a simple example and got very plausible results. For the R example, I used the following definition of the function to optimize: f <- function(beta) { p = w(rowSums(x %*% matrix(beta, ncol=1))); r1 = -sum(y*log(p+(p==0))+(1-y)*log(1-p+(p==1))); r2=lambda*sum(abs(beta)); (r1+r2) } w <- function(x) { return(1/(1+exp(-x))) } Here beta is the coefficient vector, lambda sets the amount of regularization, x are the input vectors one observation per row, y are the known categories for the rows of x, f is the combined log likelihood (r1) and log prior (r2), and w is the logistic function. I used an unsimplified form for the overall logistic likelihood for simplicity. Normally, a simpler form is used of -sum(y - p), but I wanted to keep things straightforward. The attached file sgd.csv contains the value of x. The value of y is simply 30 0's followed by 30 1's. Optimization was done using this: lambda <- 0.1 beta.01 <- optim(beta,f, method="CG", control=list(maxit=10000)) lambda <- 1 beta.1 <- optim(beta,f, method="CG", control=list(maxit=10000)) lambda <- 10 beta.10 <- optim(beta,f, method="CG", control=list(maxit=10000)) The values for beta obtained are contained in the file r.csv and the log-MAP likelihoods are in logP.csv I will shortly add a patch that has my initial test in it. This patch will contain these test data files. I will be working on this problem off and on over the next few days, but any hints that anybody has are welcome. My expectation is that there is a silly oversight in my Java code.
          Hide
          Ted Dunning added a comment -

          Here are the derivations of the formulae used.

          Show
          Ted Dunning added a comment - Here are the derivations of the formulae used.
          Hide
          Ted Dunning added a comment -

          Here is the patch with test files and a description of the derivation of the formulae.

          Show
          Ted Dunning added a comment - Here is the patch with test files and a description of the derivation of the formulae.
          Hide
          Ted Dunning added a comment -

          The orginal code was very nearly correct as it turns out. The problem is that lambda in the batch learning is used to weight the prior against all of the training examples. In the on-line algorithm the prior gradient is applied for each update.

          In the example I used, this caused an effective increase in the value of lambda by 60 (the number of training examples).

          After adjusting the value of lambda, I get values from the on-line algorithm very similar to those obtained by the batch algorithm (after lots of iterations).

          I will post a new patch shortly for review.

          Show
          Ted Dunning added a comment - The orginal code was very nearly correct as it turns out. The problem is that lambda in the batch learning is used to weight the prior against all of the training examples. In the on-line algorithm the prior gradient is applied for each update. In the example I used, this caused an effective increase in the value of lambda by 60 (the number of training examples). After adjusting the value of lambda, I get values from the on-line algorithm very similar to those obtained by the batch algorithm (after lots of iterations). I will post a new patch shortly for review.
          Hide
          Steve Umfleet added a comment -

          Hi Ted. Watching your progress on SGD was instructive. Thanks for the "template" of how to submit and proceed with an issue.

          At what point in the process are decisions about packages resolved? For example, MurmurHash at first glance, and based on its own documentation, seems like it might be broadly useful outside of org.apache.mahout.classifier.

          Show
          Steve Umfleet added a comment - Hi Ted. Watching your progress on SGD was instructive. Thanks for the "template" of how to submit and proceed with an issue. At what point in the process are decisions about packages resolved? For example, MurmurHash at first glance, and based on its own documentation, seems like it might be broadly useful outside of org.apache.mahout.classifier.
          Hide
          Ted Dunning added a comment -

          This is the time. The MurmurHash and Randomizer classes both seem ripe for promotion to other packages.

          What I will do is file some additional JIRA's that include just those classes (one JIRA for Murmur, one for Randomizer/Vectorizer). Those patches will probably make it in before this one does because they are simpler. At that point, I will rework the patch on this JIRA to not include those classes.

          Where would you recommend these others go?

          Show
          Ted Dunning added a comment - This is the time. The MurmurHash and Randomizer classes both seem ripe for promotion to other packages. What I will do is file some additional JIRA's that include just those classes (one JIRA for Murmur, one for Randomizer/Vectorizer). Those patches will probably make it in before this one does because they are simpler. At that point, I will rework the patch on this JIRA to not include those classes. Where would you recommend these others go?
          Hide
          Jake Mannix added a comment -

          Where would you recommend these others go?

          Somewhere in the math module, package name, I don't know.

          Show
          Jake Mannix added a comment - Where would you recommend these others go? Somewhere in the math module, package name, I don't know.
          Hide
          Robin Anil added a comment -

          I say. let the hash functions be in math.

          The text Randomizers can go in util.vectors

          vectors.lucence, vectors.arff etc are there currently. Or we move the all these to core along with Randomizers and DictionaryBased?

          Show
          Robin Anil added a comment - I say. let the hash functions be in math. The text Randomizers can go in util.vectors vectors.lucence, vectors.arff etc are there currently. Or we move the all these to core along with Randomizers and DictionaryBased?
          Hide
          Jake Mannix added a comment -

          I think I just drove myself nearly insane: I was creating a patch for MAHOUT-206, but I had already merged in Ted's patch here, and then trying to test apply the patch over to a fresh trunk checkout, it couldn't find these classes, so I went hunting thoughout all of SVN history, trying to find it, but they had "vanished". They were here just fine in my local git-repo, but somehow there was no log of them anywhere, even when I started digging through older revisions on svn.apache.org... gone!

          Heh. Good side-effect: I have a patch which updates this patch. Of course, it's not useful until this is committed. What more is needed on this, Ted?

          Show
          Jake Mannix added a comment - I think I just drove myself nearly insane: I was creating a patch for MAHOUT-206 , but I had already merged in Ted's patch here, and then trying to test apply the patch over to a fresh trunk checkout, it couldn't find these classes, so I went hunting thoughout all of SVN history, trying to find it, but they had "vanished". They were here just fine in my local git-repo, but somehow there was no log of them anywhere, even when I started digging through older revisions on svn.apache.org... gone! Heh. Good side-effect: I have a patch which updates this patch. Of course, it's not useful until this is committed. What more is needed on this, Ted?
          Hide
          Ted Dunning added a comment -

          We need a few things:

          • a few functions should be separated out for more general utitlity
          • the random vectorizer should be generalized a bit
          • we need some real world testing. 20 newsgroups would be a good test as would be rcv1. Cloning the new svm package's tests would probably be the best short-term answer.

          I, unfortunately, won't have time for a week or two to followup.

          As such, perhaps the best step is to commit this now. It won't break anything.

          Show
          Ted Dunning added a comment - We need a few things: a few functions should be separated out for more general utitlity the random vectorizer should be generalized a bit we need some real world testing. 20 newsgroups would be a good test as would be rcv1. Cloning the new svm package's tests would probably be the best short-term answer. I, unfortunately, won't have time for a week or two to followup. As such, perhaps the best step is to commit this now. It won't break anything.
          Hide
          Olivier Grisel added a comment -

          For the records: I am working adding more tests and debugging in the following branch (keps in sync with the trunk) hosted on github:

          http://github.com/ogrisel/mahout/commits/MAHOUT-228

          Fixed so far:

          • convergence issues (inconstency on the index of the 'missing' beta row)
          • make sure that L1 is sparsity inducing my apply eager post update regularization

          Still TODO (independently of Ted's TODOs) - migh be splitted into specific jira issues:

          • test that highly redundant dataset can lean to very sparse models with L1 prior
          • an hadoop driver to do // extraction vector features of documents using the Randomizer classes
          • an hadoop driver to do // cross validation and confusion matrix evaluation (along with confidence interval)
          • an hadoop driver to perform hyperparameters grid search (lambda, priorfunc, learning rate, ...)
          • a sample hadoop driver to categorize wikipedia articles by country
          • profile it a bit
          Show
          Olivier Grisel added a comment - For the records: I am working adding more tests and debugging in the following branch (keps in sync with the trunk) hosted on github: http://github.com/ogrisel/mahout/commits/MAHOUT-228 Fixed so far: convergence issues (inconstency on the index of the 'missing' beta row) make sure that L1 is sparsity inducing my apply eager post update regularization Still TODO (independently of Ted's TODOs) - migh be splitted into specific jira issues: test that highly redundant dataset can lean to very sparse models with L1 prior an hadoop driver to do // extraction vector features of documents using the Randomizer classes an hadoop driver to do // cross validation and confusion matrix evaluation (along with confidence interval) an hadoop driver to perform hyperparameters grid search (lambda, priorfunc, learning rate, ...) a sample hadoop driver to categorize wikipedia articles by country profile it a bit
          Hide
          Ted Dunning added a comment -

          make sure that L1 is sparsity inducing my apply eager post update regularization

          Are you sure that this is correct? The lazy regularization update should be applied before any coefficient is used for prediction or for update. Is eager regularization after the update necessary?

          Show
          Ted Dunning added a comment - make sure that L1 is sparsity inducing my apply eager post update regularization Are you sure that this is correct? The lazy regularization update should be applied before any coefficient is used for prediction or for update. Is eager regularization after the update necessary?
          Hide
          Olivier Grisel added a comment - - edited

          Are you sure that this is correct? The lazy regularization update should be applied before any coefficient is used for prediction or for update. Is eager regularization after the update necessary?

          I made it eager only for the coefficients that have just been updated by the current train step, the remaining coefficients regularization is still delayed until the next "classify(instance)" affecting those coefficients.

          If we do not do this (or find a somehow equivalent work around) the coefficient are only regularized upon the classify(instance) call and hence are marked as regularized for the current step value while at the same time the training update make the coefficient of the current step non-null hence inducing a completely dense parameters set.

          While this is not a big deal as long as beta is using a DenseMatrix representation, this prevents us from actually measuring the real impact of the lambda value by measuring the sparsity of the parameters. Maybe on problems leading to very sparse models, using a SparseRowMatrix of some kind will be determinant performance-wise and in that case the sparsity inducing ability of L1 should be ensured.

          Maybe lazy regularization could also be implemented in a more simple / readable way by doing full regularizeration of beta every "regularizationSkip" training steps (IIRC, this is the case in Leon Bottou's SvmSgd2 but this adds yet another hyperparameter to fiddle with).

          There might also be a way to mostly keep the lazy reg as it is and rethink the updateSteps update to avoid breaking the sparsity of L1. Maybe this is just a matter of moving the step++; call after the classify(instance); call. I don't remember if it tried that in the first place...

          Show
          Olivier Grisel added a comment - - edited Are you sure that this is correct? The lazy regularization update should be applied before any coefficient is used for prediction or for update. Is eager regularization after the update necessary? I made it eager only for the coefficients that have just been updated by the current train step, the remaining coefficients regularization is still delayed until the next "classify(instance)" affecting those coefficients. If we do not do this (or find a somehow equivalent work around) the coefficient are only regularized upon the classify(instance) call and hence are marked as regularized for the current step value while at the same time the training update make the coefficient of the current step non-null hence inducing a completely dense parameters set. While this is not a big deal as long as beta is using a DenseMatrix representation, this prevents us from actually measuring the real impact of the lambda value by measuring the sparsity of the parameters. Maybe on problems leading to very sparse models, using a SparseRowMatrix of some kind will be determinant performance-wise and in that case the sparsity inducing ability of L1 should be ensured. Maybe lazy regularization could also be implemented in a more simple / readable way by doing full regularizeration of beta every "regularizationSkip" training steps (IIRC, this is the case in Leon Bottou's SvmSgd2 but this adds yet another hyperparameter to fiddle with). There might also be a way to mostly keep the lazy reg as it is and rethink the updateSteps update to avoid breaking the sparsity of L1. Maybe this is just a matter of moving the step++; call after the classify(instance); call. I don't remember if it tried that in the first place...
          Hide
          Olivier Grisel added a comment -

          Indeed, just moving the step++ call after the update makes the sparsification work as expected will keeping the code natural (no forceOne flag hack).

          Show
          Olivier Grisel added a comment - Indeed, just moving the step++ call after the update makes the sparsification work as expected will keeping the code natural (no forceOne flag hack).
          Hide
          Robin Anil added a comment -

          Hi Ted, Is there a new patch with separated randomizer?.

          I see lots of code checkin in oliver's git branch. Can you update the same as a patch here.

          Show
          Robin Anil added a comment - Hi Ted, Is there a new patch with separated randomizer?. I see lots of code checkin in oliver's git branch. Can you update the same as a patch here.
          Hide
          Jake Mannix added a comment -

          Pushing out to 0.4 based on Olivier's comments on mahout-dev

          Show
          Jake Mannix added a comment - Pushing out to 0.4 based on Olivier's comments on mahout-dev
          Hide
          Jake Mannix added a comment -

          bump

          I think this is now the third time I'd brought this patch up-to-date. Compiles, but internal tests don't pass. Not sure why, as I haven't dug into them too deeply.

          Ted, or anyone else with a desire to get Vowpal-Wabbit-style awesomeness in Mahout, want to take this patch for a spin and see what is up with it?

          Or if you, Ted, don't have time to finish it yourself, could you at least check this patch out, and document a little about what the rest of us need to do to get this up running (and verified as working)?

          Show
          Jake Mannix added a comment - bump I think this is now the third time I'd brought this patch up-to-date. Compiles, but internal tests don't pass. Not sure why, as I haven't dug into them too deeply. Ted, or anyone else with a desire to get Vowpal-Wabbit-style awesomeness in Mahout, want to take this patch for a spin and see what is up with it? Or if you, Ted, don't have time to finish it yourself, could you at least check this patch out, and document a little about what the rest of us need to do to get this up running (and verified as working)?
          Hide
          Ted Dunning added a comment -

          Or if you, Ted, don't have time to finish it yourself, could you at least check this patch out, and document a little about what the rest of us need to do to get this up running (and verified as working)?

          That only sounds fair given what you have done so far.

          Let me dig in tomorrow.

          Show
          Ted Dunning added a comment - Or if you, Ted, don't have time to finish it yourself, could you at least check this patch out, and document a little about what the rest of us need to do to get this up running (and verified as working)? That only sounds fair given what you have done so far. Let me dig in tomorrow.
          Hide
          Jake Mannix added a comment -

          Excellent. The only thing I did to make it compile was update SparseVector to RandomAccessSparseVector, and replace Functions.exp in favor of the merged Colt/Mahout Functions.exp.

          So it should basically be the way you left it. Not sure why the TermRandomizerTest doesn't pass.

          Show
          Jake Mannix added a comment - Excellent. The only thing I did to make it compile was update SparseVector to RandomAccessSparseVector, and replace Functions.exp in favor of the merged Colt/Mahout Functions.exp. So it should basically be the way you left it. Not sure why the TermRandomizerTest doesn't pass.
          Hide
          Ted Dunning added a comment -

          Updated patch

          Show
          Ted Dunning added a comment - Updated patch
          Hide
          Ted Dunning added a comment -

          Now has a working and almost useful version of TrainLogistic command line.

          This command line will solve a simple example case that I am working out for the Mahout in Action book:

          java -cp<mumble> org.apache.mahout.classifier.sgd.TrainLogistic \
          --passes 100 --rate 50 --lambda 0.001 \
          --input donut.csv --features 21 --output foo \
          --target color --categories 2 \
          --predictors x y xx xy yy a b c --types n n

          I still need to

          • output the model
          • change prints into log statements
          • build the book-end TestLogistic function
          • integrate into the mahout command line driver framework and
          • build a DumpResourceData program.

          Otherwise, this is beginning to coalesce.

          Show
          Ted Dunning added a comment - Now has a working and almost useful version of TrainLogistic command line. This command line will solve a simple example case that I am working out for the Mahout in Action book: java -cp<mumble> org.apache.mahout.classifier.sgd.TrainLogistic \ --passes 100 --rate 50 --lambda 0.001 \ --input donut.csv --features 21 --output foo \ --target color --categories 2 \ --predictors x y xx xy yy a b c --types n n I still need to output the model change prints into log statements build the book-end TestLogistic function integrate into the mahout command line driver framework and build a DumpResourceData program. Otherwise, this is beginning to coalesce.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #70 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/70/)
          MAHOUT-228 - test case for recent bug

          Show
          Hudson added a comment - Integrated in Mahout-Quality #70 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/70/ ) MAHOUT-228 - test case for recent bug
          Hide
          Ted Dunning added a comment -

          Updated patch.

          This patch includes:

          • ability to run and test logistic models from the mahout command line interface
          • AUC computation
          • algorithmic improvements
          • ability to save and restore logistic regression models and input reading parameters
          • includes small sample data as resource for go/no-go testing of the compile process and quickstart with classification.

          Defects include:

          • many copyright notices missing
          • limited real-life testing
          • missing several of Olivier's improvements
          • no numerical or speed optimizations yet
          • stuff

          Near and medium-term plans include:

          • test on some more realistic data
          • throw away some defunct code
          • first commit
          • wiki page for quick-start
          • magic knob tuning for learning parameters via evolutionary algorithms

          Overall, this is getting close to useful for friendly users on non-critical data.

          Show
          Ted Dunning added a comment - Updated patch. This patch includes: ability to run and test logistic models from the mahout command line interface AUC computation algorithmic improvements ability to save and restore logistic regression models and input reading parameters includes small sample data as resource for go/no-go testing of the compile process and quickstart with classification. Defects include: many copyright notices missing limited real-life testing missing several of Olivier's improvements no numerical or speed optimizations yet stuff Near and medium-term plans include: test on some more realistic data throw away some defunct code first commit wiki page for quick-start magic knob tuning for learning parameters via evolutionary algorithms Overall, this is getting close to useful for friendly users on non-critical data.
          Hide
          Drew Farris added a comment -

          Played with this a bit tonight to see how it worked. I was able to get the donut example working fine. Had the idea to use the text in ClassifierData.DATA as test input to TrainLogistic al la the BayesClassifierSelfTest. Attached is a patch including the simple test.

          This input has 2 columns, 'label' and 'text' which get assigned to the target and predictors arguments respectively. 'text' is processed by the TextValueEncoder.

          I had to modified TextValueEncoder to override setTraceDictionary to pass the dictionary reference to the wordEncoder.

          Once did this I could train but I ran into a problem producing the final output. Near line 85 in TrainLogistic the predictorWeight method is called with the original column name 'text', not the predictor names generated by TextValueEncoder. Did you have any thoughts as to the best way to modify the code so that the proper predictor names are used?

          Once that's fixed, predictorWeight will need to be modified to properly extract the weight for a predictor generated by WordValueEncoder from the lr's beta matrix. I can tell that the traceDictionary's entry points to the positions in the vector where the word's weight is stored, but I'm not sure where to go from there.

          Show
          Drew Farris added a comment - Played with this a bit tonight to see how it worked. I was able to get the donut example working fine. Had the idea to use the text in ClassifierData.DATA as test input to TrainLogistic al la the BayesClassifierSelfTest. Attached is a patch including the simple test. This input has 2 columns, 'label' and 'text' which get assigned to the target and predictors arguments respectively. 'text' is processed by the TextValueEncoder. I had to modified TextValueEncoder to override setTraceDictionary to pass the dictionary reference to the wordEncoder. Once did this I could train but I ran into a problem producing the final output. Near line 85 in TrainLogistic the predictorWeight method is called with the original column name 'text', not the predictor names generated by TextValueEncoder. Did you have any thoughts as to the best way to modify the code so that the proper predictor names are used? Once that's fixed, predictorWeight will need to be modified to properly extract the weight for a predictor generated by WordValueEncoder from the lr's beta matrix. I can tell that the traceDictionary's entry points to the positions in the vector where the word's weight is stored, but I'm not sure where to go from there.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Mahout-Quality #156 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/156/ ) MAHOUT-228
          Hide
          Ted Dunning added a comment -

          I have created a git repo with this bug's source to increase ability for folks to play/change this code.

          Send me email at tdunning@a.o if you want committer rights to the git.

          The main URL is git@github.com:tdunning/MAHOUT-228.git

          Read-only access can be had at git://github.com/tdunning/MAHOUT-228.git

          I am new to github and don't entirely understand the permission scheme yet so let me know if you have problems.

          Show
          Ted Dunning added a comment - I have created a git repo with this bug's source to increase ability for folks to play/change this code. Send me email at tdunning@a.o if you want committer rights to the git. The main URL is git@github.com:tdunning/ MAHOUT-228 .git Read-only access can be had at git://github.com/tdunning/MAHOUT-228.git I am new to github and don't entirely understand the permission scheme yet so let me know if you have problems.
          Hide
          Ted Dunning added a comment -

          I am going to start committing this in stages. The first step will be the interfaces for classifiers in general. This will include an interface for online vector classifier learning and an interface for vector classification.

          Patch for these will come shortly with commit shortly after that. The intent of the patch is to simplify review.

          Show
          Ted Dunning added a comment - I am going to start committing this in stages. The first step will be the interfaces for classifiers in general. This will include an interface for online vector classifier learning and an interface for vector classification. Patch for these will come shortly with commit shortly after that. The intent of the patch is to simplify review.
          Hide
          Ted Dunning added a comment -

          I will commit this patch pretty much right now. THis is for reference.

          Show
          Ted Dunning added a comment - I will commit this patch pretty much right now. THis is for reference.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #195 (See https://hudson.apache.org/hudson/job/Mahout-Quality/195/)
          MAHOUT-228 Interfaces for on-line classifiers

          Show
          Hudson added a comment - Integrated in Mahout-Quality #195 (See https://hudson.apache.org/hudson/job/Mahout-Quality/195/ ) MAHOUT-228 Interfaces for on-line classifiers
          Hide
          Ted Dunning added a comment -

          Just committed a bunch of classifier stuff.

          Show
          Ted Dunning added a comment - Just committed a bunch of classifier stuff.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #198 (See https://hudson.apache.org/hudson/job/Mahout-Quality/198/)
          MAHOUT-228 - Hudson build failed, but I can't reproduce it. Changed OnlineAucTest to inject a pre-seeded PRNG. Should make test deterministic.

          Show
          Hudson added a comment - Integrated in Mahout-Quality #198 (See https://hudson.apache.org/hudson/job/Mahout-Quality/198/ ) MAHOUT-228 - Hudson build failed, but I can't reproduce it. Changed OnlineAucTest to inject a pre-seeded PRNG. Should make test deterministic.
          Hide
          Ted Dunning added a comment -

          OK.

          I think that was the final commit of the basic M-228 functionality. The results should appear in https://hudson.apache.org/hudson/job/Mahout-Quality/204/

          We now have a reasonable selection of alternatives for online vector learning. This includes versions that do online cross-validation with AUC and a version that does hyper-parameter selection and annealing using evolutionary techniques on top of the on-line cross-validation version.

          I will be adding test cases over the next few weeks, but after the dust clears here, I will be closing M-228. Jake should be proud.

          Show
          Ted Dunning added a comment - OK. I think that was the final commit of the basic M-228 functionality. The results should appear in https://hudson.apache.org/hudson/job/Mahout-Quality/204/ We now have a reasonable selection of alternatives for online vector learning. This includes versions that do online cross-validation with AUC and a version that does hyper-parameter selection and annealing using evolutionary techniques on top of the on-line cross-validation version. I will be adding test cases over the next few weeks, but after the dust clears here, I will be closing M-228. Jake should be proud.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #207 (See https://hudson.apache.org/hudson/job/Mahout-Quality/207/)
          MAHOUT-228 All tests in math and core now succeed.

          Show
          Hudson added a comment - Integrated in Mahout-Quality #207 (See https://hudson.apache.org/hudson/job/Mahout-Quality/207/ ) MAHOUT-228 All tests in math and core now succeed.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #210 (See https://hudson.apache.org/hudson/job/Mahout-Quality/210/)
          MAHOUT-228 Cleans up initialization of ALR's

          Show
          Hudson added a comment - Integrated in Mahout-Quality #210 (See https://hudson.apache.org/hudson/job/Mahout-Quality/210/ ) MAHOUT-228 Cleans up initialization of ALR's
          Hide
          Ted Dunning added a comment -

          There are small cleanups still needed, but the big AdaptiveLogisticRegression is in place.

          Show
          Ted Dunning added a comment - There are small cleanups still needed, but the big AdaptiveLogisticRegression is in place.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #222 (See https://hudson.apache.org/hudson/job/Mahout-Quality/222/)
          MAHOUT-228 - Needed configuration for command line tools

          Show
          Hudson added a comment - Integrated in Mahout-Quality #222 (See https://hudson.apache.org/hudson/job/Mahout-Quality/222/ ) MAHOUT-228 - Needed configuration for command line tools

            People

            • Assignee:
              Ted Dunning
              Reporter:
              Ted Dunning
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development