Mahout
  1. Mahout
  2. MAHOUT-232

Implementation of sequential SVM solver based on Pegasos

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: 0.4
    • Fix Version/s: None
    • Component/s: Classification
    • Labels:
      None

      Description

      After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos for Mahout platform (mahout command line style, SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS.

      Sequential SVM based on Pegasos.
      Maxim zhao (zhaozhendong at gmail dot com)

      -------------------------------------------------------------------------------------------
      Currently, this package provides (Features):
      -------------------------------------------------------------------------------------------

      1. Sequential SVM linear solver, include training and testing.

      2. Support general file system and HDFS right now.

      3. Supporting large-scale data set training.
      Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
      the certain size (e.g. max iteration) of samples to memory.
      For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
      as the result, this package only random load 10,000 samples to memory.

      4. Sequential Data set testing, then the package can support large-scale data set both on training and testing.

      5. Supporting parallel classification (only testing phrase) based on Map-Reduce framework.

      6. Supoorting Multi-classfication based on Map-Reduce framework (whole parallelized version).

      7. Supporting Regression.

      -------------------------------------------------------------------------------------------
      TODO:
      -------------------------------------------------------------------------------------------
      1. Multi-classification Probability Prediction
      2. Performance Testing

      -------------------------------------------------------------------------------------------
      Usage:
      -------------------------------------------------------------------------------------------
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Classification:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      @@ Training: @@
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      SVMPegasosTraining.java
      The default argument is:

      -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model

      ~~~~~~~~~~~~~~~~~~~~~~
      @ For the case that training data set on HDFS:@
      ~~~~~~~~~~~~~~~~~~~~~~

      1 Assure that your training data set has been submitted to hdfs
      hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

      2 revise the argument:
      -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009

      ~~~~~~~~~~~~~~~~~~~~~~
      @ Multi-class Training [Based on MapReduce Framework]:@
      ~~~~~~~~~~~~~~~~~~~~~~
      bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M -ttt 1080

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      @@ Testing: @@
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      SVMPegasosTesting.java
      I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
      The default argument is:
      -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model

      ~~~~~~~~~~~~~~~~~~~~~~
      @ Parallel Testing (Classification): @
      ~~~~~~~~~~~~~~~~~~~~~~
      ParallelClassifierDriver.java
      bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080

      ~~~~~~~~~~~~~~~~~~~~~~
      @ Parallel multi-classification: @
      ~~~~~~~~~~~~~~~~~~~~~~
      bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver -if /user/maximzhao/dataset/protein.t -of /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080

      Note: the parameter -ms 241572968 is obtained by equation : ms = input files size / number of mapper.

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Regression:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      SVMPegasosTraining.java
      -tr ../examples/src/test/resources/svmdataset/abalone_scale -m ../examples/src/test/resources/svmdataset/SVMregression.model -s 1

      -------------------------------------------------------------------------------------------
      Experimental Results:
      -------------------------------------------------------------------------------------------
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Classsification:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set:
      name source type class training size testing size feature
      -----------------------------------------------------------------------------------------------
      rcv1.binary [DL04b] classification 2 20,242 677,399 47,236
      covtype.binary UCI classification 2 581,012 54
      a9a UCI classification 2 32,561 16,281 123
      w8a [JP98a] classification 2 49,749 14,951 300

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set | Accuracy | Training Time | Testing Time |
      rcv1.binary | 94.67% | 19 Sec | 2 min 25 Sec |
      covtype.binary | | 19 Sec | |
      a9a | 84.72% | 14 Sec | 12 Sec |
      w8a | 89.8 % | 14 Sec | 8 Sec |

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Parallel Classification (Testing)
      Data set | Accuracy | Training Time | Testing Time |
      rcv1.binary | 94.98% | 19 Sec | 3 min 29 Sec (one node)|

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Parallel Multi-classification Based on MapReduce Framework:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set:
      name | source | type | class | training size | testing size | feature
      -----------------------------------------------------------------------------------------------
      poker | UCI | classification | 10 | 25,010 | 1,000,000 | 10
      protein | [JYW02a] | classification | 3 | 17,766 | 6,621 | 357

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set | Accuracy vs. (Libsvm with linear kernel)
      poker | 50.14 % vs. ( 49.952% ) |
      protein | 68.14% vs. ( 64.93% ) |

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Regression:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set:
      name | source | type | class | training size | testing size | feature
      -----------------------------------------------------------------------------------------------
      abalone | UCI | regression | 4,177 | | 8
      triazines | UCI | regression | 186 | | 60
      cadata | StatLib | regression | 20,640 | | 8
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set | Mean Squared error vs. (Libsvm with linear kernel) | Training Time | Test Time |
      abalone | 6.01 vs. (5.25) | 13 Sec |
      triazines | 0.031 vs. (0.0276) | 14 Sec |
      cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |

      1. SVMonMahout0.5.patch
        249 kB
        zhao zhendong
      2. SVMonMahout0.5.1.patch
        256 kB
        zhao zhendong
      3. SVMDataset.patch
        2.93 MB
        zhao zhendong
      4. SequentialSVM_0.4.patch
        3.30 MB
        zhao zhendong
      5. SequentialSVM_0.3.patch
        3.30 MB
        zhao zhendong
      6. SequentialSVM_0.2.2.patch
        2.80 MB
        zhao zhendong
      7. SequentialSVM_0.1.patch
        2.79 MB
        zhao zhendong
      8. Mahout-232-0.8.patch
        258 kB
        zhao zhendong
      9. a2a.mvc
        138 kB
        zhao zhendong
      10. 0004-A-script-for-svm-classification-on-20news-dataset.patch
        3 kB
        Viktor Gal
      11. 0003-Change-MultiClassifierDrivers-type-to-AbstractJob.patch
        34 kB
        Viktor Gal
      12. 0002-Renamed-HADOOP_MODLE_PATH-to-HADOOP_MODEL_PATH.patch
        6 kB
        Viktor Gal
      13. 0001-Rename-DatastoreSequenenceFile-class.patch
        16 kB
        Viktor Gal

        Issue Links

          Activity

          Hide
          Reza Fathzadeh added a comment - - edited

          I thought some of you might be interested in Snabler, a project to implement Parallel ML for M/R: https://github.com/atbrox/Snabler

          Show
          Reza Fathzadeh added a comment - - edited I thought some of you might be interested in Snabler, a project to implement Parallel ML for M/R: https://github.com/atbrox/Snabler
          Hide
          Ted Dunning added a comment -

          The ASF is applying. I haven't heard anything from our committers, but I
          would imagine that somebody would be interested in mentoring 1-4 students.
          More news anon.

          Show
          Ted Dunning added a comment - The ASF is applying. I haven't heard anything from our committers, but I would imagine that somebody would be interested in mentoring 1-4 students. More news anon.
          Hide
          Viktor Gal added a comment -

          ok, i've done a little bit of research of the current state of parallelized SVM in journals and i've found the following two interesting papers:

          . .. A Distributed SVM for Image Annotation (http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5569084). from the paper:
          "We have implemented distributed SMO using Hadoop
          and the WEKA package [12]. The basic idea of our distributed
          implementation is similar to Kun et al [11] idea. Caching
          scheme is used to error cache the current predicted output and
          update each successful step also Kernel cache is used for
          Kernel value between 2 points which is used during error
          cache updates. The algorithm partitioning the entire training
          data set into smaller subsets m partition and allocating each of
          the partitioned is allocated to a single Map task; number of
          Map tasks is equal to the number data partitions. Each Map
          task optimizes a partition in parallel. The output of each Map
          task is the alpha array for local partition and the value of b.
          Reducer simply joins the partial alpha arrays to produce the
          global alpha array. The reducer has to deal with the value of b
          because this value is different for each partition therefore the
          reducer takes the average value of b for all partitions to obtain
          global b."

          unfortunately their code is not available anywhere, which is weird as they claim to use open-source tools to implement it. Anyhow the paper they are referring to (Kun et al) has some comparison results of various SVMs, see it here: http://www.libsou.com/pdf/01650257.pdf

          . .. the other possibility is PSVM: http://code.google.com/p/psvm/
          it's fully implemented using MPI.

          i'm still trying to make up my mind which direction should i take. anyhow i think i'll open a new issue for this improvement. any plans to apply for GSoC this year. could this maybe be a GSoC project?

          Show
          Viktor Gal added a comment - ok, i've done a little bit of research of the current state of parallelized SVM in journals and i've found the following two interesting papers: . .. A Distributed SVM for Image Annotation ( http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5569084 ). from the paper: "We have implemented distributed SMO using Hadoop and the WEKA package [12] . The basic idea of our distributed implementation is similar to Kun et al [11] idea. Caching scheme is used to error cache the current predicted output and update each successful step also Kernel cache is used for Kernel value between 2 points which is used during error cache updates. The algorithm partitioning the entire training data set into smaller subsets m partition and allocating each of the partitioned is allocated to a single Map task; number of Map tasks is equal to the number data partitions. Each Map task optimizes a partition in parallel. The output of each Map task is the alpha array for local partition and the value of b. Reducer simply joins the partial alpha arrays to produce the global alpha array. The reducer has to deal with the value of b because this value is different for each partition therefore the reducer takes the average value of b for all partitions to obtain global b." unfortunately their code is not available anywhere, which is weird as they claim to use open-source tools to implement it. Anyhow the paper they are referring to (Kun et al) has some comparison results of various SVMs, see it here: http://www.libsou.com/pdf/01650257.pdf . .. the other possibility is PSVM: http://code.google.com/p/psvm/ it's fully implemented using MPI. i'm still trying to make up my mind which direction should i take. anyhow i think i'll open a new issue for this improvement. any plans to apply for GSoC this year. could this maybe be a GSoC project?
          Hide
          Ted Dunning added a comment -

          Viktor,

          I am really sorry if I was demotivating. Please let me withdraw that.

          Right now, you are the one with the ball on this. If you think it needs to be re-implemented in a cleaner fashion, then take the current code as a learning experience and move forward with what you think is necessary. Sean's comments are just right. You are the one making progress and that puts you in an important position.

          Go ahead and file a new JIRA while we decide whether to close this one. Take bits of the current code if you like or the approach or nothing as you see fit.

          Show
          Ted Dunning added a comment - Viktor, I am really sorry if I was demotivating. Please let me withdraw that. Right now, you are the one with the ball on this. If you think it needs to be re-implemented in a cleaner fashion, then take the current code as a learning experience and move forward with what you think is necessary. Sean's comments are just right. You are the one making progress and that puts you in an important position. Go ahead and file a new JIRA while we decide whether to close this one. Take bits of the current code if you like or the approach or nothing as you see fit.
          Hide
          Sean Owen added a comment -

          There seems to be little support to commit this patch, in original or cleaned-up form, into Mahout now. Ted, Robin, Viktor, and Zhao (in absentia) don't seem to be pushing for it. So let's perhaps not try to cram this in?

          Instead, I see that Viktor is doing some great work to understand and get this functionality into Mahout the "right way" as he sees it. He would be more keen to try starting from scratch.

          I personally think that's fine. Suggestion: close this one, and open a new issue as and when the brand-new take on it is ready? I'd rather let Viktor run with it as he likes than try to maintain some old code that's not even in the project.

          Show
          Sean Owen added a comment - There seems to be little support to commit this patch, in original or cleaned-up form, into Mahout now. Ted, Robin, Viktor, and Zhao (in absentia) don't seem to be pushing for it. So let's perhaps not try to cram this in? Instead, I see that Viktor is doing some great work to understand and get this functionality into Mahout the "right way" as he sees it. He would be more keen to try starting from scratch. I personally think that's fine. Suggestion: close this one, and open a new issue as and when the brand-new take on it is ready? I'd rather let Viktor run with it as he likes than try to maintain some old code that's not even in the project.
          Hide
          Viktor Gal added a comment - - edited

          Patches against Mahout-232-0.8.patch to make it work at all.

          There's still heaps to do change in the original patch, as there's soooo many things hardwired into it.
          These little patches just fixes a small part of the whole patch, i.e. to run multi-label classification.

          in my opinion the whole code needs refactoring, as i really don't understand and agree with some of the design decisions that has been made while writing this patch.

          of course your input for my patches would be really valuable.

          although, i must be honest that the more i'm trying to fix up this patch the more i think that it should be rewritten from the scratch. and Ted's comment was not so 'motivating' either. I.e. trying to fix up this patch, but actually for obvious reasons it might not even end up ever in the trunk

          Show
          Viktor Gal added a comment - - edited Patches against Mahout-232-0.8.patch to make it work at all. There's still heaps to do change in the original patch, as there's soooo many things hardwired into it. These little patches just fixes a small part of the whole patch, i.e. to run multi-label classification. in my opinion the whole code needs refactoring, as i really don't understand and agree with some of the design decisions that has been made while writing this patch. of course your input for my patches would be really valuable. although, i must be honest that the more i'm trying to fix up this patch the more i think that it should be rewritten from the scratch. and Ted's comment was not so 'motivating' either. I.e. trying to fix up this patch, but actually for obvious reasons it might not even end up ever in the trunk
          Hide
          Ted Dunning added a comment -

          @Viktor,

          Sorry, didn't mean to sow confusion.

          I just didn't want you to sign up for parallelizing currently sequential algorithms. Did we ever see results from the multi-classifier? Is there lift in training? Or is there simply speedup in evaluation?

          Show
          Ted Dunning added a comment - @Viktor, Sorry, didn't mean to sow confusion. I just didn't want you to sign up for parallelizing currently sequential algorithms. Did we ever see results from the multi-classifier? Is there lift in training? Or is there simply speedup in evaluation?
          Hide
          Robin Anil added a comment -

          Since this is a major piece of code, I would consider to have it checked in some experimental/contrib folder so that further work can be done on it. I remember an earlier conversation about keeping first citizen algorithms and surfacing them via the mahout script and the "almost there" work somewhere in the trunk. Something similar to what Lucene does with contrib

          Show
          Robin Anil added a comment - Since this is a major piece of code, I would consider to have it checked in some experimental/contrib folder so that further work can be done on it. I remember an earlier conversation about keeping first citizen algorithms and surfacing them via the mahout script and the "almost there" work somewhere in the trunk. Something similar to what Lucene does with contrib
          Hide
          Viktor Gal added a comment -

          @Ted the multi-classification is 'parallel'... not even that? shame as i've spent some time already to fix up some stuff with the patch and get it into a working shape.

          Show
          Viktor Gal added a comment - @Ted the multi-classification is 'parallel'... not even that? shame as i've spent some time already to fix up some stuff with the patch and get it into a working shape.
          Hide
          Ted Dunning added a comment -

          This is a sequential implementation. Putting it on Mahout is not of interest without major algorithmic changes.

          Show
          Ted Dunning added a comment - This is a sequential implementation. Putting it on Mahout is not of interest without major algorithmic changes.
          Hide
          Sean Owen added a comment -

          Yes, clearly the patch is out of date with respect to the code base, and/or has errors (like the class name misspelling there). Part of the work is to fix that up.

          Show
          Sean Owen added a comment - Yes, clearly the patch is out of date with respect to the code base, and/or has errors (like the class name misspelling there). Part of the work is to fix that up.
          Hide
          Viktor Gal added a comment -

          the stack trace:

          -------------------------------------------------------------------------------
          Test set: org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest
          -------------------------------------------------------------------------------
          Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.052 sec <<< FAILURE!
          testGetDatafromSequenceFile(org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest) Time elapsed: 0.046 sec <<< ERROR!
          java.lang.NullPointerException
          at org.apache.mahout.classifier.svm.datastore.DatastoreSequenenceFile.loadLabels(DatastoreSequenenceFile.java:179)
          at org.apache.mahout.classifier.svm.datastore.DatastoreSequenenceFile.getDatafromSequenceFile(DatastoreSequenenceFile.java:91)
          at org.apache.mahout.classifier.svm.datastore.DatastoreSequenenceFile.getData(DatastoreSequenenceFile.java:54)
          at org.apache.mahout.classifier.svm.datastore.DataStoreFactory.getDataStore(DataStoreFactory.java:31)
          at org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest.testGetDatafromSequenceFile(DataSetHandlerTest.java:18)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
          at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
          at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
          at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
          at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
          at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
          at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
          at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
          at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
          at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
          at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
          at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
          at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
          at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
          at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)

          Show
          Viktor Gal added a comment - the stack trace: ------------------------------------------------------------------------------- Test set: org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest ------------------------------------------------------------------------------- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.052 sec <<< FAILURE! testGetDatafromSequenceFile(org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest) Time elapsed: 0.046 sec <<< ERROR! java.lang.NullPointerException at org.apache.mahout.classifier.svm.datastore.DatastoreSequenenceFile.loadLabels(DatastoreSequenenceFile.java:179) at org.apache.mahout.classifier.svm.datastore.DatastoreSequenenceFile.getDatafromSequenceFile(DatastoreSequenenceFile.java:91) at org.apache.mahout.classifier.svm.datastore.DatastoreSequenenceFile.getData(DatastoreSequenenceFile.java:54) at org.apache.mahout.classifier.svm.datastore.DataStoreFactory.getDataStore(DataStoreFactory.java:31) at org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest.testGetDatafromSequenceFile(DataSetHandlerTest.java:18) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102) at org.apache.maven.surefire.Surefire.run(Surefire.java:180) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)
          Hide
          Robin Anil added a comment -

          Let me list down some tasks that may be necessary to whip the patch to a committable state

          There were some spelling mistakes in the function/method names, maybe you can fix any apparent ones when are are reading through the code. Try to make the code run within a Hadoop job, I am sure there are some fixes necessary to do that. Remove any hardcoded paths or ports that you see. Finally get an end to end example running maybe using 20newsgroups.

          Show
          Robin Anil added a comment - Let me list down some tasks that may be necessary to whip the patch to a committable state There were some spelling mistakes in the function/method names, maybe you can fix any apparent ones when are are reading through the code. Try to make the code run within a Hadoop job, I am sure there are some fixes necessary to do that. Remove any hardcoded paths or ports that you see. Finally get an end to end example running maybe using 20newsgroups.
          Hide
          Ted Dunning added a comment -

          Can you post details on this error? (stack trace and such)

          Show
          Ted Dunning added a comment - Can you post details on this error? (stack trace and such)
          Hide
          Viktor Gal added a comment -

          @Ted well, currently apart from using linear SVM for classification, i'm doing l1-norm minimization tasks as well and afaik SGD could work there for me, but i'll have to look into it.

          Show
          Viktor Gal added a comment - @Ted well, currently apart from using linear SVM for classification, i'm doing l1-norm minimization tasks as well and afaik SGD could work there for me, but i'll have to look into it.
          Hide
          Viktor Gal added a comment -

          alrighty, then i'll start by looking up deprecated function calls, but yeah while i've tried to compile mahout with the patch i've got an error:

          Tests in error:
          testGetDatafromSequenceFile(org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest)

          so i suppose this needs fixing as well.

          anyhow if anybody has more insights on this task please share!

          Show
          Viktor Gal added a comment - alrighty, then i'll start by looking up deprecated function calls, but yeah while i've tried to compile mahout with the patch i've got an error: Tests in error: testGetDatafromSequenceFile(org.apache.mahout.classifier.svm.datastore.DataSetHandlerTest) so i suppose this needs fixing as well. anyhow if anybody has more insights on this task please share!
          Hide
          Ted Dunning added a comment -

          Great to see somebody picking this up.

          The place that I expect to see problems are any use of the legacy math classes. Look for deprecations.

          Also, can you check to see if your use case is better served by SGD models rather than SVM's? That question has come up several times and I can only say something about the SGD side of the house.

          Show
          Ted Dunning added a comment - Great to see somebody picking this up. The place that I expect to see problems are any use of the legacy math classes. Look for deprecations. Also, can you check to see if your use case is better served by SGD models rather than SVM's? That question has come up several times and I can only say something about the SGD side of the house.
          Hide
          Viktor Gal added a comment -

          although i'm just in the stage of applying and compiling the last patch (Mahout-232-0.8.patch) with the HEAD of the svn trunk, i was wondering what sorts of fixing is still needed to merge the patch into the svn?

          I'll look into the patch myself, but if Zhao or anybody else can give me--just--pointers what are the things that need fixing in the patch i'd really appreciate it as the work could be done faster.

          Show
          Viktor Gal added a comment - although i'm just in the stage of applying and compiling the last patch (Mahout-232-0.8.patch) with the HEAD of the svn trunk, i was wondering what sorts of fixing is still needed to merge the patch into the svn? I'll look into the patch myself, but if Zhao or anybody else can give me-- just --pointers what are the things that need fixing in the patch i'd really appreciate it as the work could be done faster.
          Hide
          Sean Owen added a comment -

          I think this is at best a "Later" now – gone stale and the patch is almost surely not still relevant. I'd love it if Zhao can revive this (or anyone else).

          Show
          Sean Owen added a comment - I think this is at best a "Later" now – gone stale and the patch is almost surely not still relevant. I'd love it if Zhao can revive this (or anyone else).
          Hide
          Ted Dunning added a comment -

          Sounds like this can't move before we release. I moved it to 0.5 as a result.

          Show
          Ted Dunning added a comment - Sounds like this can't move before we release. I moved it to 0.5 as a result.
          Hide
          zhao zhendong added a comment -

          Still need some works to do. I don't know whether I will be free this month,
          seems too busy about my new job.


          -------------------------------------------------------------

          Zhen-Dong Zhao (Maxim)

          <><<><><><><><><><>><><><><><>>>>>>

          Department of Computer Science
          School of Computing
          National University of Singapore

          Show
          zhao zhendong added a comment - Still need some works to do. I don't know whether I will be free this month, seems too busy about my new job. – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore
          Hide
          Sean Owen added a comment -

          Has this gone stale? not sure of the status.

          Show
          Sean Owen added a comment - Has this gone stale? not sure of the status.
          Hide
          zhao zhendong added a comment -

          New updated revision
          1) Heads to current SVN/trunk Vector version.

          2) Supports Sequence File Input

          Note: Please download a2a.mvc and save it to ~/examples/src/test/resources/svmdataset/ after patching the new patch. Then you can build the Mahout project, otherwise, it will fail when maven tests the SVM package.

          Show
          zhao zhendong added a comment - New updated revision 1) Heads to current SVN/trunk Vector version. 2) Supports Sequence File Input Note: Please download a2a.mvc and save it to ~/examples/src/test/resources/svmdataset/ after patching the new patch. Then you can build the Mahout project, otherwise, it will fail when maven tests the SVM package.
          Hide
          zhao zhendong added a comment -

          MapReduce/MapReduceUtil.java
          should have been mapreduce/MapReduceUtil.java
          the folders are NOT in camel case. I still see camel casing everywhere.
          >> Done. Change MapReduce -> mapreduce, ParallelAlgorithms -> parallelalgorithms and SequentialAlgorithms -> sequentialalgorithms

          + public static final String DEFAULT_HDFS_SERVER = "hdfs://localhost:12009";
          + // For HBASE
          + public static final String DEFAULT_HBASE_SERVER = "localhost:60000";
          These are read from the hadoop conf and hbase configuraiton file. Mahout shouldnt be doing any sort of configuration internally.
          >> Hard coding in hadoop and hbase configuration have been removed. The Default HDFS and Hbase setting in SVMParameters only for MapReduce application runtime default setting.

          No System.out.Println use the Logger log instead
          >> Done.

          HDFSConfig.java, HDFSReader.java - do away with any hdfs configuration in the code. As i said Opening a FileSystem using the Configuration object would in-turn decide between local fs or hdfs based on the execution context
          >> Yeap, the Sequential algorithms use this principle you mentioned, it determines which file system it should choose according to parameter "hdfs" is given or not in training and prediction procedures. HDFSReader only server Sequential Algorithms but not for parallel algorithms based on Map/Reduce framework.

          Show
          zhao zhendong added a comment - MapReduce/MapReduceUtil.java should have been mapreduce/MapReduceUtil.java the folders are NOT in camel case. I still see camel casing everywhere. >> Done. Change MapReduce -> mapreduce, ParallelAlgorithms -> parallelalgorithms and SequentialAlgorithms -> sequentialalgorithms + public static final String DEFAULT_HDFS_SERVER = "hdfs://localhost:12009"; + // For HBASE + public static final String DEFAULT_HBASE_SERVER = "localhost:60000"; These are read from the hadoop conf and hbase configuraiton file. Mahout shouldnt be doing any sort of configuration internally. >> Hard coding in hadoop and hbase configuration have been removed. The Default HDFS and Hbase setting in SVMParameters only for MapReduce application runtime default setting. No System.out.Println use the Logger log instead >> Done. HDFSConfig.java, HDFSReader.java - do away with any hdfs configuration in the code. As i said Opening a FileSystem using the Configuration object would in-turn decide between local fs or hdfs based on the execution context >> Yeap, the Sequential algorithms use this principle you mentioned, it determines which file system it should choose according to parameter "hdfs" is given or not in training and prediction procedures. HDFSReader only server Sequential Algorithms but not for parallel algorithms based on Map/Reduce framework.
          Hide
          Robin Anil added a comment -

          Name change suggestion
          ParallelClassifierJobber => ParallelClassifierDriver | ParallelClassifierJob

          Show
          Robin Anil added a comment - Name change suggestion ParallelClassifierJobber => ParallelClassifierDriver | ParallelClassifierJob
          Hide
          Robin Anil added a comment -

          Patch is looking great. couple more comments

          MapReduce/MapReduceUtil.java
          should have been mapreduce/MapReduceUtil.java
          the folders are NOT in camel case. I still see camel casing everywhere.

          +  public static final String DEFAULT_HDFS_SERVER = "hdfs://localhost:12009";
          +  // For HBASE
          +  public static final String DEFAULT_HBASE_SERVER = "localhost:60000";
          

          These are read from the hadoop conf and hbase configuraiton file. Mahout shouldnt be doing any sort of configuration internally.

          No System.out.Println use the Logger log instead

          HDFSConfig.java, HDFSReader.java - do away with any hdfs configuration in the code. As i said Opening a FileSystem using the Configuration object would in-turn decide between local fs or hdfs based on the execution context

          I havent looked into the code much. But otherwise looks ok, except the changes above

          Show
          Robin Anil added a comment - Patch is looking great. couple more comments MapReduce/MapReduceUtil.java should have been mapreduce/MapReduceUtil.java the folders are NOT in camel case. I still see camel casing everywhere. + public static final String DEFAULT_HDFS_SERVER = "hdfs: //localhost:12009" ; + // For HBASE + public static final String DEFAULT_HBASE_SERVER = "localhost:60000" ; These are read from the hadoop conf and hbase configuraiton file. Mahout shouldnt be doing any sort of configuration internally. No System.out.Println use the Logger log instead HDFSConfig.java, HDFSReader.java - do away with any hdfs configuration in the code. As i said Opening a FileSystem using the Configuration object would in-turn decide between local fs or hdfs based on the execution context I havent looked into the code much. But otherwise looks ok, except the changes above
          Hide
          zhao zhendong added a comment -

          Try using Mahout collections OpenIntDoubleHashMap etc. I have seen super memory savings using them as compared to java collections. WeightVector memory footprint would halve.
          >> Change to Mahout collections.

          Package names are not camel case, I saw import org.apache.mahout.classifier.svm.MapReduce.Testing.TestRawKeyValueIterator; should have been org.apache.mahout.classifier.svm.mapreduce.TestRawKeyValueIterator in the test directory not main
          >> Done. Remove all useless test file and main function

          No author tags. See any class in Mahout for reference
          >> Done. Remove all author tags. Netbeans add it defaultly. I have switched to Eclipse.

          Your test classes could be re-used, we already have a dummy output collector and Dummy status reporter in common. How about moving testing classes there or reusing them. Feel free to modify them or add functionality.
          >> Still work on it.

          Organize imports. Are you using the Mahout(lucene based) code formatter. Its here, https://issues.apache.org/jira/browse/MAHOUT-233
          is there a need for a parameter parser ? Check out common.parameter.* You could reuse the parameter classes there. See KMeansMapper for usage.
          >> Done. I Checked the code style using lucene checkstyle package.

          In HDFS writer . "/user/maximzhao/test.t" I see hardcoded paths. Should make it configurable
          >> Done. Remove it.

          I dont think using HDFSWriter class is the best way for writing to HDFS. FileSystem object would select the appropriate filesystem based on the Hadoop Configuration. This enforces that your classes read and write to HDFS via namenode making the code unusable for local execution. Plus, this really shouldnt be used when running a Map/reduce, underlying Filesystem object is already pointing to HDFS. Creating socket connnections is not a good thing when Map/Reducing.
          >> Done. Remove the HDFS writer. The Map/Reduce only use normal way for input and output.

          LibSVMFormatParser could be moved to utils package, Not in core. Like ARFF format reader, we can have the libsvm format reader
          >> Still work on it. Currently, the LibsvmFormatParser still in this package.

          Move readme to Package.html so that javadoc generates the package summary.
          >> Done.

          Also if you can separate out the dataset from the patch and upload two separate files. I think others might have issues(read legal) with including reuters data in mahout trunk
          >> Done. Change to two distinct patches.

          Show
          zhao zhendong added a comment - Try using Mahout collections OpenIntDoubleHashMap etc. I have seen super memory savings using them as compared to java collections. WeightVector memory footprint would halve. >> Change to Mahout collections. Package names are not camel case, I saw import org.apache.mahout.classifier.svm.MapReduce.Testing.TestRawKeyValueIterator; should have been org.apache.mahout.classifier.svm.mapreduce.TestRawKeyValueIterator in the test directory not main >> Done. Remove all useless test file and main function No author tags. See any class in Mahout for reference >> Done. Remove all author tags. Netbeans add it defaultly. I have switched to Eclipse. Your test classes could be re-used, we already have a dummy output collector and Dummy status reporter in common. How about moving testing classes there or reusing them. Feel free to modify them or add functionality. >> Still work on it. Organize imports. Are you using the Mahout(lucene based) code formatter. Its here, https://issues.apache.org/jira/browse/MAHOUT-233 is there a need for a parameter parser ? Check out common.parameter.* You could reuse the parameter classes there. See KMeansMapper for usage. >> Done. I Checked the code style using lucene checkstyle package. In HDFS writer . "/user/maximzhao/test.t" I see hardcoded paths. Should make it configurable >> Done. Remove it. I dont think using HDFSWriter class is the best way for writing to HDFS. FileSystem object would select the appropriate filesystem based on the Hadoop Configuration. This enforces that your classes read and write to HDFS via namenode making the code unusable for local execution. Plus, this really shouldnt be used when running a Map/reduce, underlying Filesystem object is already pointing to HDFS. Creating socket connnections is not a good thing when Map/Reducing. >> Done. Remove the HDFS writer. The Map/Reduce only use normal way for input and output. LibSVMFormatParser could be moved to utils package, Not in core. Like ARFF format reader, we can have the libsvm format reader >> Still work on it. Currently, the LibsvmFormatParser still in this package. Move readme to Package.html so that javadoc generates the package summary. >> Done. Also if you can separate out the dataset from the patch and upload two separate files. I think others might have issues(read legal) with including reuters data in mahout trunk >> Done. Change to two distinct patches.
          Hide
          zhao zhendong added a comment -

          Hi Sean,

          For Mahout-232, I suppose to finished code style checking *by this end of
          week (Revised Based on Robin's Comments)*.

          I don't know whether it could be pushed in 0.3. But I just wanna you guys
          know the progress of this issue.

          Cheers,
          Zhendong


          -------------------------------------------------------------

          Zhen-Dong Zhao (Maxim)

          <><<><><><><><><><>><><><><><>>>>>>

          Department of Computer Science
          School of Computing
          National University of Singapore

          Show
          zhao zhendong added a comment - Hi Sean, For Mahout-232, I suppose to finished code style checking *by this end of week (Revised Based on Robin's Comments)*. I don't know whether it could be pushed in 0.3. But I just wanna you guys know the progress of this issue. Cheers, Zhendong – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore
          Hide
          Sean Owen added a comment -

          This is evidently linked to MAHOUT-227 and so pushes to 0.4 too, I assume. No need to rush this.

          Show
          Sean Owen added a comment - This is evidently linked to MAHOUT-227 and so pushes to 0.4 too, I assume. No need to rush this.
          Hide
          Robin Anil added a comment -

          Some Comments

          • Try using Mahout collections OpenIntDoubleHashMap etc. I have seen super memory savings using them as compared to java collections. WeightVector memory footprint would halve.
          • Package names are not camel case, I saw import org.apache.mahout.classifier.svm.MapReduce.Testing.TestRawKeyValueIterator; should have been org.apache.mahout.classifier.svm.mapreduce.TestRawKeyValueIterator in the test directory not main
          • Move all test classes to test directory
          • No author tags. See any class in Mahout for reference
          • Your test classes could be re-used, we already have a dummy output collector and Dummy status reporter in common. How about moving testing classes there or reusing them. Feel free to modify them or add functionality.
          • Organize imports. Are you using the Mahout(lucene based) code formatter. Its here, https://issues.apache.org/jira/browse/MAHOUT-233
          • is there a need for a parameter parser ? Check out common.parameter.* You could reuse the parameter classes there. See KMeansMapper for usage.
          • In HDFS writer . "/user/maximzhao/test.t" I see hardcoded paths. Should make it configurable
          • I dont think using HDFSWriter class is the best way for writing to HDFS. FileSystem object would select the appropriate filesystem based on the Hadoop Configuration. This enforces that your classes read and write to HDFS via namenode making the code unusable for local execution. Plus, this really shouldnt be used when running a Map/reduce, underlying Filesystem object is already pointing to HDFS. Creating socket connnections is not a good thing when Map/Reducing.
          • LibSVMFormatParser could be moved to utils package, Not in core. Like ARFF format reader, we can have the libsvm format reader
          • Move readme to Package.html so that javadoc generates the package summary.
          • Also if you can separate out the dataset from the patch and upload two separate files. I think others might have issues(read legal) with including reuters data in mahout trunk
          Show
          Robin Anil added a comment - Some Comments Try using Mahout collections OpenIntDoubleHashMap etc. I have seen super memory savings using them as compared to java collections. WeightVector memory footprint would halve. Package names are not camel case, I saw import org.apache.mahout.classifier.svm.MapReduce.Testing.TestRawKeyValueIterator; should have been org.apache.mahout.classifier.svm.mapreduce.TestRawKeyValueIterator in the test directory not main Move all test classes to test directory No author tags. See any class in Mahout for reference Your test classes could be re-used, we already have a dummy output collector and Dummy status reporter in common. How about moving testing classes there or reusing them. Feel free to modify them or add functionality. Organize imports. Are you using the Mahout(lucene based) code formatter. Its here, https://issues.apache.org/jira/browse/MAHOUT-233 is there a need for a parameter parser ? Check out common.parameter.* You could reuse the parameter classes there. See KMeansMapper for usage. In HDFS writer . "/user/maximzhao/test.t" I see hardcoded paths. Should make it configurable I dont think using HDFSWriter class is the best way for writing to HDFS. FileSystem object would select the appropriate filesystem based on the Hadoop Configuration. This enforces that your classes read and write to HDFS via namenode making the code unusable for local execution. Plus, this really shouldnt be used when running a Map/reduce, underlying Filesystem object is already pointing to HDFS. Creating socket connnections is not a good thing when Map/Reducing. LibSVMFormatParser could be moved to utils package, Not in core. Like ARFF format reader, we can have the libsvm format reader Move readme to Package.html so that javadoc generates the package summary. Also if you can separate out the dataset from the patch and upload two separate files. I think others might have issues(read legal) with including reuters data in mahout trunk
          Hide
          zhao zhendong added a comment -

          I have changed the class directory of parallel algorithms.

          Show
          zhao zhendong added a comment - I have changed the class directory of parallel algorithms.
          Hide
          zhao zhendong added a comment -

          1) Supporting sequential multi-classification (both one-vs.-one and one-vs.-others approaches);

          2) Refactor and code cleaning.

          3) Switch to SequentialAccessSparseVector and RandomAccessSparseVector.

          Show
          zhao zhendong added a comment - 1) Supporting sequential multi-classification (both one-vs.-one and one-vs.-others approaches); 2) Refactor and code cleaning. 3) Switch to SequentialAccessSparseVector and RandomAccessSparseVector.
          Hide
          zhao zhendong added a comment -

          Hi Ted,

          Got it.

          By the way, you may call me Zhendong for short, Zhao is my family name.

          Cheers,
          Zhendong


          -------------------------------------------------------------

          Zhen-Dong Zhao (Maxim)

          <><<><><><><><><><>><><><><><>>>>>>

          Department of Computer Science
          School of Computing
          National University of Singapore

          Homepage:http://zhaozhendong.googlepages.com
          Mail: zhaozhendong@gmail.com

          Show
          zhao zhendong added a comment - Hi Ted, Got it. By the way, you may call me Zhendong for short, Zhao is my family name. Cheers, Zhendong – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore Homepage: http://zhaozhendong.googlepages.com Mail: zhaozhendong@gmail.com
          Hide
          Ted Dunning added a comment -

          zhaozhendong,

          Nice results so far.

          I would recommend not editing the original description, but adding new data as separate comments. Otherwise, it is difficult for readers like me to understand what is new and what has changed.

          Show
          Ted Dunning added a comment - zhaozhendong, Nice results so far. I would recommend not editing the original description, but adding new data as separate comments. Otherwise, it is difficult for readers like me to understand what is new and what has changed.
          Hide
          zhao zhendong added a comment -

          Thanks for Ted's Comments, I will revise the code.


          -------------------------------------------------------------

          Zhen-Dong Zhao (Maxim)

          <><<><><><><><><><>><><><><><>>>>>>

          Department of Computer Science
          School of Computing
          National University of Singapore

          Homepage:http://zhaozhendong.googlepages.com
          Mail: zhaozhendong@gmail.com

          Show
          zhao zhendong added a comment - Thanks for Ted's Comments, I will revise the code. – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore Homepage: http://zhaozhendong.googlepages.com Mail: zhaozhendong@gmail.com
          Hide
          Ted Dunning added a comment -

          I had only a few minutes just now to look at this code and have a few stylistic comments:

          • don't use author tags, do use the Lucene standard indentation
          • don't use abbreviations ESPECIALLY if you are not a native speaker. For example: examPerIter should probably be examplesPerIteration, but the abbreviation as given could mean examinationsPerIteration.
          • test code should be in unit tests. Examples are OK as main methods, but if you want to test something now, it should probably be a unit test so that it stays true.
          Show
          Ted Dunning added a comment - I had only a few minutes just now to look at this code and have a few stylistic comments: don't use author tags, do use the Lucene standard indentation don't use abbreviations ESPECIALLY if you are not a native speaker. For example: examPerIter should probably be examplesPerIteration, but the abbreviation as given could mean examinationsPerIteration. test code should be in unit tests. Examples are OK as main methods, but if you want to test something now, it should probably be a unit test so that it stays true.
          Hide
          zhao zhendong added a comment -

          Oops, I forgot to add these files to SVN.


          -------------------------------------------------------------

          Zhen-Dong Zhao (Maxim)

          <><<><><><><><><><>><><><><><>>>>>>

          Department of Computer Science
          School of Computing
          National University of Singapore

          Homepage:http://zhaozhendong.googlepages.com
          Mail: zhaozhendong@gmail.com

          Show
          zhao zhendong added a comment - Oops, I forgot to add these files to SVN. – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore Homepage: http://zhaozhendong.googlepages.com Mail: zhaozhendong@gmail.com
          Hide
          Ted Dunning added a comment -

          The 0.1 patch compiles for me, but the 0.2 patch produces this problem:

          /Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[195,8] cannot find symbol
          symbol  : variable HDFSConfig
          location: class org.apache.mahout.classifier.svm.DataSetHandler
          
          /Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[244,8] cannot find symbol
          symbol  : variable HDFSConfig
          location: class org.apache.mahout.classifier.svm.DataSetHandler
          

          It seems that something has been dropped from the patch.

          Show
          Ted Dunning added a comment - The 0.1 patch compiles for me, but the 0.2 patch produces this problem: /Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[195,8] cannot find symbol symbol : variable HDFSConfig location: class org.apache.mahout.classifier.svm.DataSetHandler /Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[244,8] cannot find symbol symbol : variable HDFSConfig location: class org.apache.mahout.classifier.svm.DataSetHandler It seems that something has been dropped from the patch.
          Hide
          Jake Mannix added a comment -

          Use the constructor which allows you to specify both the initial size and cardinality:

          new SparseVector(Integer.MAX_VALUE, 10);
          

          for example will be "infinite" dimensional vectors with initial map capacity 10.

          Show
          Jake Mannix added a comment - Use the constructor which allows you to specify both the initial size and cardinality: new SparseVector( Integer .MAX_VALUE, 10); for example will be "infinite" dimensional vectors with initial map capacity 10.
          Hide
          zhao zhendong added a comment - - edited

          Great idea. But I check the code. The actually allocated size is 1/8 of
          cardinality. Is it correct?

          If this is true, it means it will occupy a lot memory, it is not necessary
          for small data set


          -------------------------------------------------------------

          Zhen-Dong Zhao (Maxim)

          <><<><><><><><><><>><><><><><>>>>>>

          Department of Computer Science
          School of Computing
          National University of Singapore

          Homepage:http://zhaozhendong.googlepages.com
          Mail: zhaozhendong@gmail.com

          Show
          zhao zhendong added a comment - - edited Great idea. But I check the code. The actually allocated size is 1/8 of cardinality. Is it correct? If this is true, it means it will occupy a lot memory, it is not necessary for small data set – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore Homepage: http://zhaozhendong.googlepages.com Mail: zhaozhendong@gmail.com
          Hide
          Jake Mannix added a comment -

          I must assign the cardinality of matrix while create them.

          You can just construct your matrices and vectors with all dimensions = Integer.MAX_VALUE as a work around. At some point we will be adding the ability to have unbounded or undetermined dimension vector spaces, but this works in practice.

          Show
          Jake Mannix added a comment - I must assign the cardinality of matrix while create them. You can just construct your matrices and vectors with all dimensions = Integer.MAX_VALUE as a work around. At some point we will be adding the ability to have unbounded or undetermined dimension vector spaces, but this works in practice.
          Hide
          Jake Mannix added a comment -

          Just dropping a note to say that I've tried out this patch, getting output as:

          281.17521793362766 = Norm of solution
          0.0577053057402274 = avg Loss of solution
          0.006500000000000002 = avg zero-one error of solution
          0.19829291470704108 = primal objective of solution
          0.17299428453573593 = avg Loss over test
          0.025 = avg zero-one error over test
          97.5% = Testing Accuracy

          Looks great so far to me (although I haven't dug in deeply into the code yet). Awesome work so far!

          Show
          Jake Mannix added a comment - Just dropping a note to say that I've tried out this patch, getting output as: 281.17521793362766 = Norm of solution 0.0577053057402274 = avg Loss of solution 0.006500000000000002 = avg zero-one error of solution 0.19829291470704108 = primal objective of solution 0.17299428453573593 = avg Loss over test 0.025 = avg zero-one error over test 97.5% = Testing Accuracy Looks great so far to me (although I haven't dug in deeply into the code yet). Awesome work so far!
          Hide
          zhao zhendong added a comment -

          Sequential SVM based on Pegasos.
          -------------------------------------------------------------------------------------------
          Currently, this package provides (Features):
          -------------------------------------------------------------------------------------------

          1. Sequential SVM linear solver, include training and testing.

          2. It supports general file system right now, it means that HDFS supporting will be a near future work.

          3. Supporting large-scale data set. ( need to assign the argument "trainSampleNum" )
          Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
          the certain size (e.g. max iteration) of samples to memory.
          For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
          as the result, this package only randomly loads 10,000 samples to memory.

          -------------------------------------------------------------------------------------------
          TODO:
          -------------------------------------------------------------------------------------------
          1. Supporting HDFS;

          2. Because of adopting mahout.math.SparseMatrix and mahout.math.SparseVectorUnsafe,
          I must assign the cardinality of matrix while create them. It's not easy for reading
          the data set with the format of SVM-light or libsvm, which are very popular in
          Machine learning community. Such dataset does not store the number of samples and
          the size of dimension.
          Currently, I still use a stupid method to read the data to map<> first,
          then dump the data to SparseMatrix.
          Does any one know some smart methods or other matrix to support such operation?

          -------------------------------------------------------------------------------------------
          Usage:
          -------------------------------------------------------------------------------------------
          Training:
          SVMPegasosTraining.java
          I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
          The default argument is:
          -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model

          Testing:
          SVMPegasosTesting.java
          I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
          The default argument is:
          -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model

          Show
          zhao zhendong added a comment - Sequential SVM based on Pegasos. ------------------------------------------------------------------------------------------- Currently, this package provides (Features): ------------------------------------------------------------------------------------------- 1. Sequential SVM linear solver, include training and testing. 2. It supports general file system right now, it means that HDFS supporting will be a near future work. 3. Supporting large-scale data set. ( need to assign the argument "trainSampleNum" ) Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch the certain size (e.g. max iteration) of samples to memory. For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000, as the result, this package only randomly loads 10,000 samples to memory. ------------------------------------------------------------------------------------------- TODO: ------------------------------------------------------------------------------------------- 1. Supporting HDFS; 2. Because of adopting mahout.math.SparseMatrix and mahout.math.SparseVectorUnsafe, I must assign the cardinality of matrix while create them. It's not easy for reading the data set with the format of SVM-light or libsvm, which are very popular in Machine learning community. Such dataset does not store the number of samples and the size of dimension. Currently, I still use a stupid method to read the data to map<> first, then dump the data to SparseMatrix. Does any one know some smart methods or other matrix to support such operation? ------------------------------------------------------------------------------------------- Usage: ------------------------------------------------------------------------------------------- Training: SVMPegasosTraining.java I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model Testing: SVMPegasosTesting.java I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model
          Hide
          zhao zhendong added a comment - - edited

          I still work on it . I can attach them as a patch tomorrow or the day
          after tomorrow, maybe.

          I will check the code of MAHOUT-228.


          -------------------------------------------------------------

          Zhen-Dong Zhao (Maxim)

          <><<><><><><><><><>><><><><><>>>>>>

          Department of Computer Science
          School of Computing
          National University of Singapore

          Mail: zhaozhendong@gmail.com

          Show
          zhao zhendong added a comment - - edited I still work on it . I can attach them as a patch tomorrow or the day after tomorrow, maybe. I will check the code of MAHOUT-228 . – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore Mail: zhaozhendong@gmail.com
          Hide
          Ted Dunning added a comment -

          Can you post a patch containing your code so far?

          How does your implementation relate to MAHOUT-228? Is there potential for shared code?

          Show
          Ted Dunning added a comment - Can you post a patch containing your code so far? How does your implementation relate to MAHOUT-228 ? Is there potential for shared code?

            People

            • Assignee:
              Ted Dunning
              Reporter:
              zhao zhendong
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development