Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-232

Implementation of sequential SVM solver based on Pegasos

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Later
    • 0.4
    • None
    • None
    • None

    Description

      After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos for Mahout platform (mahout command line style, SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS.

      Sequential SVM based on Pegasos.
      Maxim zhao (zhaozhendong at gmail dot com)

      -------------------------------------------------------------------------------------------
      Currently, this package provides (Features):
      -------------------------------------------------------------------------------------------

      1. Sequential SVM linear solver, include training and testing.

      2. Support general file system and HDFS right now.

      3. Supporting large-scale data set training.
      Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
      the certain size (e.g. max iteration) of samples to memory.
      For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
      as the result, this package only random load 10,000 samples to memory.

      4. Sequential Data set testing, then the package can support large-scale data set both on training and testing.

      5. Supporting parallel classification (only testing phrase) based on Map-Reduce framework.

      6. Supoorting Multi-classfication based on Map-Reduce framework (whole parallelized version).

      7. Supporting Regression.

      -------------------------------------------------------------------------------------------
      TODO:
      -------------------------------------------------------------------------------------------
      1. Multi-classification Probability Prediction
      2. Performance Testing

      -------------------------------------------------------------------------------------------
      Usage:
      -------------------------------------------------------------------------------------------
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Classification:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      @@ Training: @@
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      SVMPegasosTraining.java
      The default argument is:

      -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model

      ~~~~~~~~~~~~~~~~~~~~~~
      @ For the case that training data set on HDFS:@
      ~~~~~~~~~~~~~~~~~~~~~~

      1 Assure that your training data set has been submitted to hdfs
      hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

      2 revise the argument:
      -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009

      ~~~~~~~~~~~~~~~~~~~~~~
      @ Multi-class Training [Based on MapReduce Framework]:@
      ~~~~~~~~~~~~~~~~~~~~~~
      bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M -ttt 1080

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      @@ Testing: @@
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      SVMPegasosTesting.java
      I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
      The default argument is:
      -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model

      ~~~~~~~~~~~~~~~~~~~~~~
      @ Parallel Testing (Classification): @
      ~~~~~~~~~~~~~~~~~~~~~~
      ParallelClassifierDriver.java
      bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080

      ~~~~~~~~~~~~~~~~~~~~~~
      @ Parallel multi-classification: @
      ~~~~~~~~~~~~~~~~~~~~~~
      bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver -if /user/maximzhao/dataset/protein.t -of /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080

      Note: the parameter -ms 241572968 is obtained by equation : ms = input files size / number of mapper.

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Regression:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      SVMPegasosTraining.java
      -tr ../examples/src/test/resources/svmdataset/abalone_scale -m ../examples/src/test/resources/svmdataset/SVMregression.model -s 1

      -------------------------------------------------------------------------------------------
      Experimental Results:
      -------------------------------------------------------------------------------------------
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Classsification:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set:
      name source type class training size testing size feature
      -----------------------------------------------------------------------------------------------
      rcv1.binary [DL04b] classification 2 20,242 677,399 47,236
      covtype.binary UCI classification 2 581,012 54
      a9a UCI classification 2 32,561 16,281 123
      w8a [JP98a] classification 2 49,749 14,951 300

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set | Accuracy | Training Time | Testing Time |
      rcv1.binary | 94.67% | 19 Sec | 2 min 25 Sec |
      covtype.binary | | 19 Sec | |
      a9a | 84.72% | 14 Sec | 12 Sec |
      w8a | 89.8 % | 14 Sec | 8 Sec |

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Parallel Classification (Testing)
      Data set | Accuracy | Training Time | Testing Time |
      rcv1.binary | 94.98% | 19 Sec | 3 min 29 Sec (one node)|

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Parallel Multi-classification Based on MapReduce Framework:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set:
      name | source | type | class | training size | testing size | feature
      -----------------------------------------------------------------------------------------------
      poker | UCI | classification | 10 | 25,010 | 1,000,000 | 10
      protein | [JYW02a] | classification | 3 | 17,766 | 6,621 | 357

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set | Accuracy vs. (Libsvm with linear kernel)
      poker | 50.14 % vs. ( 49.952% ) |
      protein | 68.14% vs. ( 64.93% ) |

      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Regression:
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set:
      name | source | type | class | training size | testing size | feature
      -----------------------------------------------------------------------------------------------
      abalone | UCI | regression | 4,177 | | 8
      triazines | UCI | regression | 186 | | 60
      cadata | StatLib | regression | 20,640 | | 8
      >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
      Data set | Mean Squared error vs. (Libsvm with linear kernel) | Training Time | Test Time |
      abalone | 6.01 vs. (5.25) | 13 Sec |
      triazines | 0.031 vs. (0.0276) | 14 Sec |
      cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |

      Attachments

        1. 0001-Rename-DatastoreSequenenceFile-class.patch
          16 kB
          Viktor Gal
        2. 0002-Renamed-HADOOP_MODLE_PATH-to-HADOOP_MODEL_PATH.patch
          6 kB
          Viktor Gal
        3. 0003-Change-MultiClassifierDrivers-type-to-AbstractJob.patch
          34 kB
          Viktor Gal
        4. 0004-A-script-for-svm-classification-on-20news-dataset.patch
          3 kB
          Viktor Gal
        5. a2a.mvc
          138 kB
          zhao zhendong
        6. Mahout-232-0.8.patch
          258 kB
          zhao zhendong
        7. SequentialSVM_0.1.patch
          2.79 MB
          zhao zhendong
        8. SequentialSVM_0.2.2.patch
          2.80 MB
          zhao zhendong
        9. SequentialSVM_0.3.patch
          3.30 MB
          zhao zhendong
        10. SequentialSVM_0.4.patch
          3.30 MB
          zhao zhendong
        11. SVMDataset.patch
          2.93 MB
          zhao zhendong
        12. SVMonMahout0.5.1.patch
          256 kB
          zhao zhendong
        13. SVMonMahout0.5.patch
          249 kB
          zhao zhendong

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tdunning Ted Dunning
            maximzhao zhao zhendong
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment