Mahout
  1. Mahout
  2. MAHOUT-334

Proposal for GSoC2010 (Linear SVM for Mahout)

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.4
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout

      Student: Zhen-Dong Zhao

      Student e-mail: zhaozd@comp.nus.edu.sg

      Student Major: Multimedia Information Retrieval /Computer Science

      Student Degree: Master Student Graduation: NUS'10 Organization: Hadoop

      0 Abstract
      Linear Support Vector Machine (SVM) is pretty useful in some applications with large-scale datasets or datasets with high dimension features. This proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which is another linear SVM solver and almost finished by me. Two distinct con tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces for linear SVM classifier.

      1 Motivation
      As one of TOP 10 algorithms in data mining society [3], Support Vector Machine is very powerful Machine Learning tool and widely adopted in Data Mining, Pattern Recognition and Information Retrieval domains.

      The SVM training procedure is pretty slow, however, especially on the case with large-scale dataset. Nowadays, several literatures propose SVM solvers with linear kernel that can handle large-scale learning problem, for instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232). Nevertheless, as the winner of ICML 2008 large-scale learning challenge (linear SVM track (http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR package supports:

      (1) L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
      (2) L1-regularized classifiers L2-loss linear SVM and logistic regression (LR)

      Main features of LIBLINEAR are following:
      (1) Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
      (2) Cross validation for model selection
      (3) Probability estimates (logistic regression only)
      (4) Weights for unbalanced data

      All the functionalities suppose to be implemented except probability estimates and weights for unbalanced data (If time permitting, I would like to do so).

      2 Unified Interfaces
      Linear SVM classifier based on Pegasos package on Mahout already can provide such functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232)

      (1) Sequential Binary Classification (Two-class Classification), includes sequential training and prediction;
      (2) Sequential Regression;
      (3) Parallel & Sequential Multi-Classification, includes One-vs.-One and One-vs.-Others schemes.

      Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR are quite similar to each other. As aforementioned, in this section I will introduce an unified interfaces for linear SVM classifier on Mahout, which will incorporate Pegasos, LIBLINEAR.
      The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I will introduce them separately.

      2.1 Data Handler
      The dataset can be stored on personal computer or on Hadoop cluster. This framework provides high performance Random Loader, Sequential Loader for accessing large-scale data.

      2.2 Sequential Algorithms
      Sequential Algorithms will include binary classification, regression based on Pegasos and LIBLINEAR with unified interface.

      2.3 Parallel Algorithms
      It is widely accepted that to parallelize binary SVM classifier is hard. For multi-classification, however, the coarse-grained scheme (e.g. each Mapper or Reducer has one independent SVM binary classifier) is easier to achieve great improvement. Besides, cross validation for model selection also can take advantage of such coarse-grained parallelism. I will introduce a unified interface for all of them.

      3 Biography:
      I am a graduating masters student in Multimedia Information Retrieval System from National University of Singapore. My research has involved the large-scale SVM classifier.

      I have worked with Hadoop and Map Reduce since one year ago, and I have dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in setting up and maintaining a Hadoop cluster with around 70 nodes in our group.

      4 Timeline:
      Weeks 1-4 (May 24 ~ June 18): Implement binary classifier

      Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification and Implement cross validation for model selection.

      Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation

      Weeks 9 - 11 (July 16 ~ August 9): Interface re-factory and performance turning

      Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing.

      5 References
      [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871-1874, 2008.

      [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 807-814, New York, NY, USA, 2007. ACM.

      [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007.

      1. Utils_LibsvmFormat_Convertor.patch
        19 kB
        zhao zhendong
      2. Mahout-issue334-0.5.patch
        199 kB
        zhao zhendong
      3. Mahout-issue334-0.3.patch
        171 kB
        zhao zhendong
      4. Mahout-issue334-0.2.patch
        164 kB
        zhao zhendong
      5. Mahout-issue334.patch
        248 kB
        zhao zhendong
      6. Mahout-issue334.patch
        245 kB
        zhao zhendong

        Activity

        Hide
        zhao zhendong added a comment -

        Proposal Title: Linear SVM Package (LIBLINEAR) for Mahout

        Student Name: Zhendong Zhao

        Student E-mail: zhaozd@comp.nus.edu.sg

        Organization/Project:Assigned Mentor:

        Proposal Abstract:
        Linear Support Vector Machine (SVM) is pretty useful in plenty of applications with large-scale datasets or datasets with high dimension features. This proposal will port one of the most famous linear SVM solvers, LIBLINEAR [1] to mahout with unified interface with Pegasos [2] on mahout, which is another linear SVM solver and almost finished by myself (Mahout-232). Two distinct contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces for linear SVM classifier on Mahout.

        Detailed Description:

        1 Motivation
        As one of TOP 10 algorithms in data mining society [3], Support Vector Machine is very powerful Machine Learning tool and widely adopted in Data Mining, Pattern Recognition and Information Retrieval domains.

        The SVM training procedure is pretty slow, however, especially on the case with large-scale dataset. Nowadays, several literatures propose SVM solvers with linear kernel that can handle large-scale learning problem, for instance, LIBLINEAR [1] and Pegasos [2]. I have already implemented a prototype of linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232)(http://issues.apache.org/jira/browse/MAHOUT-232). Nevertheless, as the winner of ICML 2008 large-scale learning challenge (linear SVM track (http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be incorporated in Mahout too.

        2 Functionalities to Be Implemented
        Currently, LIBLINEAR package supports:
        — L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
        — L1-regularized classifiers L2-loss linear SVM and logistic regression (LR)

        Main features of LIBLINEAR are following:
        — Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
        — Cross validation for model selection
        — Probability estimates (logistic regression only)
        — Weights for unbalanced data

        All the functionalities suppose to be implemented except probability estimates and weights for unbalanced data (If time permitting, I would like to do so), Crammer & Singer scheme for Multi-Class classification will be replace of On-vs.one method. Thus L1, L2-regularized/loss linear binary SVM solver, LR and Multi-class classification, and cross validation for model selection will be included into my proposal.

        3 Implementation Details
        Linear SVM classifier based on Pegasos package on Mahout already provides such functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232)
        — Sequential Binary Classification (Two-class Classification), includes sequential training and prediction;
        — Sequential Regression;
        — Parallel & Sequential Multi-Classification, includes One-vs.-One and One-vs.-Others schemes.

        Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR are quite similar to each other. As aforementioned, in this section I will introduce a unified interfaces for linear SVM classifier on Mahout, which will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is illustrated in Figure 1:

        http://sites.google.com/site/zhaozhendong2/home/project/framework-gsoc-small.png
        Figure 1: The framework of linear SVM on Mahout

        The unified interface has two main parts: 1) Dataset loader, 2) Algorithms. I will introduce them separately.

        3.1 Data Handler
        Due to all the packages Mahout Core are based on high performance Vector (DenseVector or SparseVector), the data handler is also based on them. The dataset (Vectors) can be stored on personal computer or on Hadoop cluster. This framework provides high performance Random Loader, Sequential Loader for accessing large-scale data.

        3.2 Sequential Algorithms
        Sequential Algorithms will include binary classification, regression based on Pegasos and LIBLINEAR with unified interface.

        3.3 Parallel Algorithms
        It is widely accepted that to parallelize binary SVM classifier is hard. For multi-class classification, however, the coarse-grained scheme (e.g. each Mapper or Reducer has one independent SVM binary classifier) is easier to achieve improvement. Besides, cross validation for model selection also can take advantage of such coarse-grained parallelism. I will introduce a unified interface for all of them.

        3.3.1 Multi-class Classification based on MapReduce Framework
        In SVM, multi-class classification can be decomposed as a set of binary classifiers, and the classifiers are independent to each other, in this sense, the multi-classification can take advantage of MapReduce framework.

        http://sites.google.com/site/zhaozhendong2/home/project/multi-classification-small.jpg
        Figure 2. Parallel Multi-class Classifier

        This package will includes two distinct schemes: (1) One versus One (One-vs.-One); (2) One versus Others (One-vs.-Others). We will explain the later scheme due to it is a bit easier to understand.

        An intuitive example will be introduced firstly. As you can see in Figure 2, the Mappers act as Emit Controller. Each sample will be emit N times with different categories' label, where N is the number of categories in dataset.

        The class label will be emitted as the key of Mappers' output. After sorting, all the samples with same class label will be sank into a same Reducer. The samples in one Reducer should with "-1" and "+1" label right now, where "+1" denotes the sample within a certain category while "-1" represents all other samples belong to rest of categories.

        Reducer then calls a binary SVM classifier to train a model for this category and emits the model as Reducer's output.

        3.3.2 Parallel Model Selection
        Similar to Multi-Class Classification, SVM Model selection is stacked with a set of binary classifier. Thus, we may leverage MapReduce framework to accelerate the process of model selection.

        4 Biography
        I am a graduating masters student in Multimedia Information Retrieval System from National University of Singapore. My research has involved the large-scale SVM classifier.

        I have worked with Hadoop and Map Reduce since one year ago, and I have dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout.(http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in setting up and maintaining a Hadoop cluster with around 75 nodes in our group.

        5 Timeline
        Weeks 1-4 (May 24 ~ June 18): Implement binary classifier

        Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification and Implement cross validation for model selection.

        Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation

        Weeks 9 - 11 (July 16 ~ August 9): Interface re-factory and performance turning

        Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing.

        6 References
        [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871-1874, 2008.

        [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 807-814, New York, NY, USA, 2007. ACM.

        [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007.

        Additional Information:
        PDF edition of this Proposal:
        http://sites.google.com/site/zhaozhendong2/home/project/GSoC2010-SVMonMahout.pdf?attredirects=0&d=1

        Show
        zhao zhendong added a comment - Proposal Title: Linear SVM Package (LIBLINEAR) for Mahout Student Name: Zhendong Zhao Student E-mail: zhaozd@comp.nus.edu.sg Organization/Project:Assigned Mentor: Proposal Abstract: Linear Support Vector Machine (SVM) is pretty useful in plenty of applications with large-scale datasets or datasets with high dimension features. This proposal will port one of the most famous linear SVM solvers, LIBLINEAR [1] to mahout with unified interface with Pegasos [2] on mahout, which is another linear SVM solver and almost finished by myself (Mahout-232). Two distinct contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces for linear SVM classifier on Mahout. Detailed Description: 1 Motivation As one of TOP 10 algorithms in data mining society [3] , Support Vector Machine is very powerful Machine Learning tool and widely adopted in Data Mining, Pattern Recognition and Information Retrieval domains. The SVM training procedure is pretty slow, however, especially on the case with large-scale dataset. Nowadays, several literatures propose SVM solvers with linear kernel that can handle large-scale learning problem, for instance, LIBLINEAR [1] and Pegasos [2] . I have already implemented a prototype of linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232)( http://issues.apache.org/jira/browse/MAHOUT-232 ). Nevertheless, as the winner of ICML 2008 large-scale learning challenge (linear SVM track ( http://largescale.first.fraunhofer.de/summary/ ), LIBLINEAR [1] suppose to be incorporated in Mahout too. 2 Functionalities to Be Implemented Currently, LIBLINEAR package supports: — L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR) — L1-regularized classifiers L2-loss linear SVM and logistic regression (LR) Main features of LIBLINEAR are following: — Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer — Cross validation for model selection — Probability estimates (logistic regression only) — Weights for unbalanced data All the functionalities suppose to be implemented except probability estimates and weights for unbalanced data (If time permitting, I would like to do so), Crammer & Singer scheme for Multi-Class classification will be replace of On-vs. one method. Thus L1 , L2-regularized/loss linear binary SVM solver, LR and Multi-class classification, and cross validation for model selection will be included into my proposal. 3 Implementation Details Linear SVM classifier based on Pegasos package on Mahout already provides such functionalities: ( http://issues.apache.org/jira/browse/MAHOUT-232 ) — Sequential Binary Classification (Two-class Classification), includes sequential training and prediction; — Sequential Regression; — Parallel & Sequential Multi-Classification, includes One-vs.-One and One-vs.-Others schemes. Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR are quite similar to each other. As aforementioned, in this section I will introduce a unified interfaces for linear SVM classifier on Mahout, which will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is illustrated in Figure 1: http://sites.google.com/site/zhaozhendong2/home/project/framework-gsoc-small.png Figure 1: The framework of linear SVM on Mahout The unified interface has two main parts: 1) Dataset loader, 2) Algorithms. I will introduce them separately. 3.1 Data Handler Due to all the packages Mahout Core are based on high performance Vector (DenseVector or SparseVector), the data handler is also based on them. The dataset (Vectors) can be stored on personal computer or on Hadoop cluster. This framework provides high performance Random Loader, Sequential Loader for accessing large-scale data. 3.2 Sequential Algorithms Sequential Algorithms will include binary classification, regression based on Pegasos and LIBLINEAR with unified interface. 3.3 Parallel Algorithms It is widely accepted that to parallelize binary SVM classifier is hard. For multi-class classification, however, the coarse-grained scheme (e.g. each Mapper or Reducer has one independent SVM binary classifier) is easier to achieve improvement. Besides, cross validation for model selection also can take advantage of such coarse-grained parallelism. I will introduce a unified interface for all of them. 3.3.1 Multi-class Classification based on MapReduce Framework In SVM, multi-class classification can be decomposed as a set of binary classifiers, and the classifiers are independent to each other, in this sense, the multi-classification can take advantage of MapReduce framework. http://sites.google.com/site/zhaozhendong2/home/project/multi-classification-small.jpg Figure 2. Parallel Multi-class Classifier This package will includes two distinct schemes: (1) One versus One (One-vs.-One); (2) One versus Others (One-vs.-Others). We will explain the later scheme due to it is a bit easier to understand. An intuitive example will be introduced firstly. As you can see in Figure 2, the Mappers act as Emit Controller. Each sample will be emit N times with different categories' label, where N is the number of categories in dataset. The class label will be emitted as the key of Mappers' output. After sorting, all the samples with same class label will be sank into a same Reducer. The samples in one Reducer should with "-1" and "+1" label right now, where "+1" denotes the sample within a certain category while "-1" represents all other samples belong to rest of categories. Reducer then calls a binary SVM classifier to train a model for this category and emits the model as Reducer's output. 3.3.2 Parallel Model Selection Similar to Multi-Class Classification, SVM Model selection is stacked with a set of binary classifier. Thus, we may leverage MapReduce framework to accelerate the process of model selection. 4 Biography I am a graduating masters student in Multimedia Information Retrieval System from National University of Singapore. My research has involved the large-scale SVM classifier. I have worked with Hadoop and Map Reduce since one year ago, and I have dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout.( http://issues.apache.org/jira/browse/MAHOUT-232 ). I have taken part in setting up and maintaining a Hadoop cluster with around 75 nodes in our group. 5 Timeline Weeks 1-4 (May 24 ~ June 18): Implement binary classifier Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification and Implement cross validation for model selection. Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation Weeks 9 - 11 (July 16 ~ August 9): Interface re-factory and performance turning Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing. 6 References [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871-1874, 2008. [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 807-814, New York, NY, USA, 2007. ACM. [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007. Additional Information: PDF edition of this Proposal: http://sites.google.com/site/zhaozhendong2/home/project/GSoC2010-SVMonMahout.pdf?attredirects=0&d=1
        Hide
        zhao zhendong added a comment -

        Is there any suggestion or comment on my proposal?

        Show
        zhao zhendong added a comment - Is there any suggestion or comment on my proposal?
        Hide
        Ted Dunning added a comment -

        There is little to say. This is an excellent proposal. You clearly have the background and the capability to succeed.

        A key limitation will be whether we have enough mentors.

        Show
        Ted Dunning added a comment - There is little to say. This is an excellent proposal. You clearly have the background and the capability to succeed. A key limitation will be whether we have enough mentors.
        Hide
        zhao zhendong added a comment -

        Libsvm Format convertor.

        Libsvm format file -> either Sequence File Vector or File using JSON format.

        Labels in Libsvm data sets will be separated to files with suffix ".labels"

        Show
        zhao zhendong added a comment - Libsvm Format convertor. Libsvm format file -> either Sequence File Vector or File using JSON format. Labels in Libsvm data sets will be separated to files with suffix ".labels"
        Hide
        zhao zhendong added a comment -

        Libsvm Format convertor.

        Libsvm format file -> either Sequence File Vector or File using JSON format.

        Labels in Libsvm data sets will be separated to files with suffix ".labels"

        Show
        zhao zhendong added a comment - Libsvm Format convertor. Libsvm format file -> either Sequence File Vector or File using JSON format. Labels in Libsvm data sets will be separated to files with suffix ".labels"
        Hide
        zhao zhendong added a comment -

        I just finished porting the sequential binary and multi-class classifier (Crammer and Singer) based on C++ and Java implementation of LIBLINEAR:

        1) Saving & Loading Model "

        2 ( 31, May ~ 4, June)
        "1) L1-regularized L2-loss SVM primary "
        "2) L2-regularized L1-loss and L2-loss SVM dual problems "

        3 ( 7, June ~ 11, June)
        "1) L1 & L2-regularized logistic regression problems "

        4 (14, June ~ 18, June)
        "1) Prediction 2) Training Main Entry 3) L1-regularized L2-loss problem"

        5( 21, June ~ 25, June)
        " Multi-class SVM by Crammer and Singer"

        It still needs lots works to clean and re-factor the code.

        Show
        zhao zhendong added a comment - I just finished porting the sequential binary and multi-class classifier (Crammer and Singer) based on C++ and Java implementation of LIBLINEAR: 1) Saving & Loading Model " 2 ( 31, May ~ 4, June) "1) L1-regularized L2-loss SVM primary " "2) L2-regularized L1-loss and L2-loss SVM dual problems " 3 ( 7, June ~ 11, June) "1) L1 & L2-regularized logistic regression problems " 4 (14, June ~ 18, June) "1) Prediction 2) Training Main Entry 3) L1-regularized L2-loss problem" 5( 21, June ~ 25, June) " Multi-class SVM by Crammer and Singer" It still needs lots works to clean and re-factor the code.
        Hide
        zhao zhendong added a comment -

        Effectiveness Testing:

        Data Set: a2a, EPS 0.1, 5-folder cross validation
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy)
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        L2-regularized logistic regression # 81.6777% # 83.5320% #
        L2-regularized L2-loss classification (dual) # 81.7219% # 84.1501% #
        L2-regularized L2-loss classification (primal) #81.766% #83.1347% #
        L2-regularized L1-loss classification (dual) # 81.6336% # 84.1501% #
        L1-regularized L2-loss classification # 81.766% # 83.8411% #
        L1-regularized logistic regression # 81.8543% # 83.3996% #
        -----------------------------------------------------------------------------------------------------------
        Data Set: News20.binary, EPS 0.1, 5-folder cross validation
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy)
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        L2-regularized logistic regression # 93.4837% # 97.1994% #
        L2-regularized L2-loss classification (dual) # 96.6243% # 99.875% #
        L2-regularized L2-loss classification (primal) # 96.7443% # 99.87% #
        L2-regularized L1-loss classification (dual) # 96.7744% # 99.5499% #
        L1-regularized L2-loss classification # 94.5389% # 97.5195% #
        L1-regularized logistic regression # 91.1082% # 92.8786% #

        -----------------------------------------------------------------------------------------------------------
        Data Set: Rcv1_train.binary, EPS 0.1, 5-folder cross validation
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy)
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        L2-regularized logistic regression # 96.2751% # 97.683% #
        L2-regularized L2-loss classification (dual) # 97.0754% # 99.6690% #
        L2-regularized L2-loss classification (primal) # 97.0161% # 99.3331% #
        L2-regularized L1-loss classification (dual) # 96.9865% # 98.9675% #
        L1-regularized L2-loss classification # 96.6703% # 98.8489% #
        L1-regularized logistic regression # 95.7267% # 96.8531% #

        Show
        zhao zhendong added a comment - Effectiveness Testing: Data Set: a2a, EPS 0.1, 5-folder cross validation >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> L2-regularized logistic regression # 81.6777% # 83.5320% # L2-regularized L2-loss classification (dual) # 81.7219% # 84.1501% # L2-regularized L2-loss classification (primal) #81.766% #83.1347% # L2-regularized L1-loss classification (dual) # 81.6336% # 84.1501% # L1-regularized L2-loss classification # 81.766% # 83.8411% # L1-regularized logistic regression # 81.8543% # 83.3996% # ----------------------------------------------------------------------------------------------------------- Data Set: News20.binary, EPS 0.1, 5-folder cross validation >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> L2-regularized logistic regression # 93.4837% # 97.1994% # L2-regularized L2-loss classification (dual) # 96.6243% # 99.875% # L2-regularized L2-loss classification (primal) # 96.7443% # 99.87% # L2-regularized L1-loss classification (dual) # 96.7744% # 99.5499% # L1-regularized L2-loss classification # 94.5389% # 97.5195% # L1-regularized logistic regression # 91.1082% # 92.8786% # ----------------------------------------------------------------------------------------------------------- Data Set: Rcv1_train.binary, EPS 0.1, 5-folder cross validation >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> L2-regularized logistic regression # 96.2751% # 97.683% # L2-regularized L2-loss classification (dual) # 97.0754% # 99.6690% # L2-regularized L2-loss classification (primal) # 97.0161% # 99.3331% # L2-regularized L1-loss classification (dual) # 96.9865% # 98.9675% # L1-regularized L2-loss classification # 96.6703% # 98.8489% # L1-regularized logistic regression # 95.7267% # 96.8531% #
        Hide
        Ted Dunning added a comment -

        These latest accuracy values on 20 newsgroups seem waaay too high. In particular, Rennie quoted an accuracy of ~ 86% for SVM on this problem in his naive bayes paper, but you are showing 99+% accuracy. In my experience getting 99.9% right on any real problem has always implied that I have introduced a target leak. The RCV1 numbers are similarly surprising.

        Can you say a little bit more about where these data sets came from and how they were pre-processed? Which 20newsgroup dataset? Which header fields were suppressed? Was it the binary version of the 20 newsgroup task described by Keerthis and DeCoste (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.4422)

        Similarly, which RCV1 dataset? How was it processed?

        Can you attach a link to the exact datasets you used with a description of the format and what the problem really is? Are they from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ ?

        Show
        Ted Dunning added a comment - These latest accuracy values on 20 newsgroups seem waaay too high. In particular, Rennie quoted an accuracy of ~ 86% for SVM on this problem in his naive bayes paper, but you are showing 99+% accuracy. In my experience getting 99.9% right on any real problem has always implied that I have introduced a target leak. The RCV1 numbers are similarly surprising. Can you say a little bit more about where these data sets came from and how they were pre-processed? Which 20newsgroup dataset? Which header fields were suppressed? Was it the binary version of the 20 newsgroup task described by Keerthis and DeCoste ( http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.4422 ) Similarly, which RCV1 dataset? How was it processed? Can you attach a link to the exact datasets you used with a description of the format and what the problem really is? Are they from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ ?
        Hide
        zhao zhendong added a comment -

        Sorry for late reply.

        Yes, it's bit surprising. At same time, we tested both datasets via standard
        liblinear, also get pretty good accuracy, say 97+%.

        All datasets are downloaded from libsvm/datasets.
        http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

        For News20, we use the news20.binary, which can be found in the link. The
        details are:

        For Rcv1_train.binary, we first scaled it using libsvm/train-scale, the
        details are:


        -------------------------------------------------------------

        Zhen-Dong Zhao (Maxim)

        <><<><><><><><><><>><><><><><>>>>>>

        Department of Computer Science
        School of Computing
        National University of Singapore

        Show
        zhao zhendong added a comment - Sorry for late reply. Yes, it's bit surprising. At same time, we tested both datasets via standard liblinear, also get pretty good accuracy, say 97+%. All datasets are downloaded from libsvm/datasets. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ For News20, we use the news20.binary, which can be found in the link. The details are: Source: [SSK05a< http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ref.html#SSK05a > ] Preprocessing: Each instance has unit length. # of classes: 2 # of data: 19,996 # of features: 1,355,191 Files: news20.binary.bz< http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2 > 2 For Rcv1_train.binary, we first scaled it using libsvm/train-scale, the details are: Source: [DL04b< http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ref.html#DL04b > ] Preprocessing: positive: CCAT, ECAT; negative: GCAT, MCAT; instances in both positive and negative classes are removed. # of classes: 2 # of data: 20,242 / 677,399 (testing) # of features: 47,236 Files: rcv1_train.binary.bz2< http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2 > – ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore
        Hide
        zhao zhendong added a comment -

        Three major revisions:
        1) Fix a bug, previous accuracies are in-correct. Right now, we got almost accuracy as same as original implementation does.

        2) Performance improvement. For example: news20, the original c/c++ implementation spends 9.4s, our method uses 33.1s by using L2-R L2-Loss SVM. Further improvement will be conducted in coming weeks.

        3) Currently, this package supports reading data from disk instead of loading all samples in Memory.

        Revised Effectiveness Testing Results:
        Data Set: a2a, EPS 0.1, 5-folder cross validation
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy)
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        L2-regularized logistic regression # 81.6777% # 81.9426% #
        L2-regularized L2-loss classification (dual) # 81.7219% # 82.3841% #
        L2-regularized L2-loss classification (primal) #81.766% #81.9867% #
        L2-regularized L1-loss classification (dual) # 81.6336% # 82.2958% #
        multi-class by Crammer and Singer #81.8102 % # 81.8543% #
        -----------------------------------------------------------------------------------------------------------
        Data Set: News20.binary, EPS 0.1, 5-folder cross validation
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy)
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        L2-regularized logistic regression # 93.4837% # 93.5987% #
        L2-regularized L2-loss classification (dual) # 96.6243% # 96.6593% #
        L2-regularized L2-loss classification (primal) # 96.7443% # 96.7143% #
        L2-regularized L1-loss classification (dual) # 96.7744% # 96.8043% #
        multi-class by Crammer and Singer #96.8844 % # 96.9593% #

        -----------------------------------------------------------------------------------------------------------
        Data Set: Rcv1_train.binary, EPS 0.1, 5-folder cross validation
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy)
        >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
        L2-regularized logistic regression # 96.2751% # 96.2355% #
        L2-regularized L2-loss classification (dual) # 97.0754% # 97.0111% #
        L2-regularized L2-loss classification (primal) # 97.0161% # 96.90247% #
        L2-regularized L1-loss classification (dual) # 96.9865% # 96.9321% #
        multi-class by Crammer and Singer #96.8844 % # 96.95188% #

        Show
        zhao zhendong added a comment - Three major revisions: 1) Fix a bug, previous accuracies are in-correct. Right now, we got almost accuracy as same as original implementation does. 2) Performance improvement. For example: news20, the original c/c++ implementation spends 9.4s, our method uses 33.1s by using L2-R L2-Loss SVM. Further improvement will be conducted in coming weeks. 3) Currently, this package supports reading data from disk instead of loading all samples in Memory. Revised Effectiveness Testing Results: Data Set: a2a, EPS 0.1, 5-folder cross validation >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> L2-regularized logistic regression # 81.6777% # 81.9426% # L2-regularized L2-loss classification (dual) # 81.7219% # 82.3841% # L2-regularized L2-loss classification (primal) #81.766% #81.9867% # L2-regularized L1-loss classification (dual) # 81.6336% # 82.2958% # multi-class by Crammer and Singer #81.8102 % # 81.8543% # ----------------------------------------------------------------------------------------------------------- Data Set: News20.binary, EPS 0.1, 5-folder cross validation >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> L2-regularized logistic regression # 93.4837% # 93.5987% # L2-regularized L2-loss classification (dual) # 96.6243% # 96.6593% # L2-regularized L2-loss classification (primal) # 96.7443% # 96.7143% # L2-regularized L1-loss classification (dual) # 96.7744% # 96.8043% # multi-class by Crammer and Singer #96.8844 % # 96.9593% # ----------------------------------------------------------------------------------------------------------- Data Set: Rcv1_train.binary, EPS 0.1, 5-folder cross validation >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Methods # Liblinear C++ (accuracy) # Liblinear on Mahout (accuracy) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> L2-regularized logistic regression # 96.2751% # 96.2355% # L2-regularized L2-loss classification (dual) # 97.0754% # 97.0111% # L2-regularized L2-loss classification (primal) # 97.0161% # 96.90247% # L2-regularized L1-loss classification (dual) # 96.9865% # 96.9321% # multi-class by Crammer and Singer #96.8844 % # 96.95188% #
        Hide
        zhao zhendong added a comment - - edited

        Could we load the data set into HBase for further randomly accessing?

        Here is the overall data loading behavior of Liblinear:

        for I = 1 : maximum iteration

        shuffle the samples (note: only subject to permutation)

        for J = 1 : all samples in the dataset with random order
        solve the optimization problem
        end

        end

        Normally, it will take 3 ~ 4 iteration.

        By the way, one interesting paper just published in KDD 2010:
        http://www.csie.ntu.edu.tw/~cjlin/papers/kdd_disk_decomposition.pdf
        They split the data set into sequential trunks, and load one block of data into memory.

        Show
        zhao zhendong added a comment - - edited Could we load the data set into HBase for further randomly accessing? Here is the overall data loading behavior of Liblinear: for I = 1 : maximum iteration shuffle the samples (note: only subject to permutation) for J = 1 : all samples in the dataset with random order solve the optimization problem end end Normally, it will take 3 ~ 4 iteration. By the way, one interesting paper just published in KDD 2010: http://www.csie.ntu.edu.tw/~cjlin/papers/kdd_disk_decomposition.pdf They split the data set into sequential trunks, and load one block of data into memory.
        Hide
        zhao zhendong added a comment -

        Add parallel Multi-class classifier to this package.

        So far, all functionalities almost be done except parallel parameter selection.

        For the case in Liblinear, which needs all training samples in each training process, it's difficult to leverage the MapReduce framework to improve the performance, include the multi-class classification and parameter selection.

        Although I've done a parallel classifier for multi-class problem using the same framework with Pegasos (Mahout-232), I believe parallel liblinear could be useless within such framework.

        The next scope of liblinear on Mahout suppose is to focus on Sequential Learning with Large-scale data sets. One interesting paper has been published by Libsvm(liblinear) group, which listed in my last post.

        Show
        zhao zhendong added a comment - Add parallel Multi-class classifier to this package. So far, all functionalities almost be done except parallel parameter selection. For the case in Liblinear, which needs all training samples in each training process, it's difficult to leverage the MapReduce framework to improve the performance, include the multi-class classification and parameter selection. Although I've done a parallel classifier for multi-class problem using the same framework with Pegasos (Mahout-232), I believe parallel liblinear could be useless within such framework. The next scope of liblinear on Mahout suppose is to focus on Sequential Learning with Large-scale data sets. One interesting paper has been published by Libsvm(liblinear) group, which listed in my last post.
        Hide
        zhao zhendong added a comment -

        Documentation and Code clean.

        >>>>>>>>>>>>>
        Conclusion
        >>>>>>>>>>>>>
        Implementation of Liblinear on Mahout teaches us a course.

        1) Current parallel framework is not suitable for Liblinear:
        Different from the implementation of Pegasos (Mahout-232), Liblinear requires whole data set to optimize the primary or dual equation. The prons and cons are listed as follows:
        Prons:
        All classifiers in Liblinear are quite stable. Whenever training a data set with one certain classifier, we always got extract same object value and accuracy.

        Cons:
        Requirement for whole data set could limit the usability of Liblinear, especially, while we need to train a classifier on an extremely large scale data set.

        In the case of Liblinear, which needs all training samples in each training process, it's difficult to leverage the MapReduce framework to improve the performance, include the multi-class classification and parameter selection.

        Although I've done a parallel classifier for multi-class problem using the same framework with Pegasos (Mahout-232), I believe parallel Liblinear could be useless within such framework.

        2) L1-regularized classifiers L2-loss linear SVM and logistic regression (LR) are not suitable for large scale data set:
        Both of them involve transpose operation on whole data set matrix ( column denotes features, one row one sample). In this sense, it's almost impossible can be applied to large - scale data.

        3) Future work:
        The next scope of Liblinear on Mahout is to focus on Sequential Learning with Large-scale data sets. One interesting paper, which just published in KDD 2010:
        http://www.csie.ntu.edu.tw/~cjlin/papers/kdd_disk_decomposition.pdf

        Show
        zhao zhendong added a comment - Documentation and Code clean. >>>>>>>>>>>>> Conclusion >>>>>>>>>>>>> Implementation of Liblinear on Mahout teaches us a course. 1) Current parallel framework is not suitable for Liblinear: Different from the implementation of Pegasos (Mahout-232), Liblinear requires whole data set to optimize the primary or dual equation. The prons and cons are listed as follows: Prons: All classifiers in Liblinear are quite stable. Whenever training a data set with one certain classifier, we always got extract same object value and accuracy. Cons: Requirement for whole data set could limit the usability of Liblinear, especially, while we need to train a classifier on an extremely large scale data set. In the case of Liblinear, which needs all training samples in each training process, it's difficult to leverage the MapReduce framework to improve the performance, include the multi-class classification and parameter selection. Although I've done a parallel classifier for multi-class problem using the same framework with Pegasos (Mahout-232), I believe parallel Liblinear could be useless within such framework. 2) L1-regularized classifiers L2-loss linear SVM and logistic regression (LR) are not suitable for large scale data set: Both of them involve transpose operation on whole data set matrix ( column denotes features, one row one sample). In this sense, it's almost impossible can be applied to large - scale data. 3) Future work: The next scope of Liblinear on Mahout is to focus on Sequential Learning with Large-scale data sets. One interesting paper, which just published in KDD 2010: http://www.csie.ntu.edu.tw/~cjlin/papers/kdd_disk_decomposition.pdf
        Hide
        Sean Owen added a comment -

        Was this committed / is it OK to commit now? Just cleaning house for 0.4.

        Show
        Sean Owen added a comment - Was this committed / is it OK to commit now? Just cleaning house for 0.4.
        Hide
        Sean Owen added a comment -

        Robin, Zhao, is the final patch the one attached? I'm not going to commit it without someone nodding – would instead move to 0.5 for more thinking later.

        Show
        Sean Owen added a comment - Robin, Zhao, is the final patch the one attached? I'm not going to commit it without someone nodding – would instead move to 0.5 for more thinking later.
        Hide
        Sean Owen added a comment -

        Guys, is this commitable? This has been sitting around for over 4 months. If it's good to go it should be in for sure.

        Show
        Sean Owen added a comment - Guys, is this commitable? This has been sitting around for over 4 months. If it's good to go it should be in for sure.
        Hide
        Ted Dunning added a comment -

        It didn't sound real ready to me. Having a non-scalable SVM just replicates capabilities elsewhere like in R.

        I do think that we should lift the evaluation metrics and apply them to SGD even if this code doesn't get committed as is.

        Show
        Ted Dunning added a comment - It didn't sound real ready to me. Having a non-scalable SVM just replicates capabilities elsewhere like in R. I do think that we should lift the evaluation metrics and apply them to SGD even if this code doesn't get committed as is.
        Hide
        Sean Owen added a comment -

        Looks like there has been no activity on this for 6 months, since GSoC finished, from the student or sponsor. Ted suggests it's not really committable or needed.

        That's not a good outcome. We should be clear that the end product is indeed going to be used, and then do the work to put it in. Worth thinking about as GSoC 2011 opens up.

        Show
        Sean Owen added a comment - Looks like there has been no activity on this for 6 months, since GSoC finished, from the student or sponsor. Ted suggests it's not really committable or needed. That's not a good outcome. We should be clear that the end product is indeed going to be used, and then do the work to put it in. Worth thinking about as GSoC 2011 opens up.
        Hide
        Michael Lugassy added a comment -

        Very interested to see this committed, any news?

        Show
        Michael Lugassy added a comment - Very interested to see this committed, any news?
        Hide
        Robin Anil added a comment -

        I will take a look at this weekend to clean this up and propose a patch. Will separate out the evaluation metrics from the algo. Then we can decide the fate.

        Show
        Robin Anil added a comment - I will take a look at this weekend to clean this up and propose a patch. Will separate out the evaluation metrics from the algo. Then we can decide the fate.

          People

          • Assignee:
            Robin Anil
            Reporter:
            zhao zhendong
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development