[MAHOUT-334] Proposal for GSoC2010 (Linear SVM for Mahout) - ASF JIRA

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.4
Fix Version/s: None
Component/s: None
Labels:
- gsoc
- gsoc2010

Description

Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout

Student: Zhen-Dong Zhao

Student e-mail: zhaozd@comp.nus.edu.sg

Student Major: Multimedia Information Retrieval /Computer Science

Student Degree: Master Student Graduation: NUS'10 Organization: Hadoop

0 Abstract
Linear Support Vector Machine (SVM) is pretty useful in some applications with large-scale datasets or datasets with high dimension features. This proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which is another linear SVM solver and almost finished by me. Two distinct con tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Uniﬁed interfaces for linear SVM classiﬁer.

1 Motivation
As one of TOP 10 algorithms in data mining society [3], Support Vector Machine is very powerful Machine Learning tool and widely adopted in Data Mining, Pattern Recognition and Information Retrieval domains.

The SVM training procedure is pretty slow, however, especially on the case with large-scale dataset. Nowadays, several literatures propose SVM solvers with linear kernel that can handle large-scale learning problem, for instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of linear SVM classiﬁer based on Pegasos [2] for Mahout (issue: Mahout-232). Nevertheless, as the winner of ICML 2008 large-scale learning challenge (linear SVM track (http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR package supports:

(1) L2-regularized classiﬁers L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
(2) L1-regularized classiﬁers L2-loss linear SVM and logistic regression (LR)

Main features of LIBLINEAR are following:
(1) Multi-class classiﬁcation: 1) one-vs-the rest, 2) Crammer & Singer
(2) Cross validation for model selection
(3) Probability estimates (logistic regression only)
(4) Weights for unbalanced data

All the functionalities suppose to be implemented except probability estimates and weights for unbalanced data (If time permitting, I would like to do so).

2 Unified Interfaces
Linear SVM classiﬁer based on Pegasos package on Mahout already can provide such functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232)

(1) Sequential Binary Classiﬁcation (Two-class Classiﬁcation), includes sequential training and prediction;
(2) Sequential Regression;
(3) Parallel & Sequential Multi-Classiﬁcation, includes One-vs.-One and One-vs.-Others schemes.

Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR are quite similar to each other. As aforementioned, in this section I will introduce an unified interfaces for linear SVM classiﬁer on Mahout, which will incorporate Pegasos, LIBLINEAR.
The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I will introduce them separately.

2.1 Data Handler
The dataset can be stored on personal computer or on Hadoop cluster. This framework provides high performance Random Loader, Sequential Loader for accessing large-scale data.

2.2 Sequential Algorithms
Sequential Algorithms will include binary classiﬁcation, regression based on Pegasos and LIBLINEAR with uniﬁed interface.

2.3 Parallel Algorithms
It is widely accepted that to parallelize binary SVM classiﬁer is hard. For multi-classiﬁcation, however, the coarse-grained scheme (e.g. each Mapper or Reducer has one independent SVM binary classiﬁer) is easier to achieve great improvement. Besides, cross validation for model selection also can take advantage of such coarse-grained parallelism. I will introduce a uniﬁed interface for all of them.

3 Biography:
I am a graduating masters student in Multimedia Information Retrieval System from National University of Singapore. My research has involved the large-scale SVM classifier.

I have worked with Hadoop and Map Reduce since one year ago, and I have dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in setting up and maintaining a Hadoop cluster with around 70 nodes in our group.

4 Timeline:
Weeks 1-4 (May 24 ~ June 18): Implement binary classifier

Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification and Implement cross validation for model selection.

Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation

Weeks 9 - 11 (July 16 ~ August 9): Interface re-factory and performance turning

Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing.

5 References
[1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classiﬁcation. J. Mach. Learn. Res., 9:1871-1874, 2008.

[2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 807-814, New York, NY, USA, 2007. ACM.

[3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Mahout-issue334.patch
14/Aug/10 08:58
245 kB
zhao zhendong
Mahout-issue334.patch
09/Aug/10 08:38
248 kB
zhao zhendong
Mahout-issue334-0.2.patch
21/Jun/10 17:19
164 kB
zhao zhendong
Mahout-issue334-0.3.patch
14/Jul/10 11:25
171 kB
zhao zhendong
Mahout-issue334-0.5.patch
29/Jul/10 15:53
199 kB
zhao zhendong
Utils_LibsvmFormat_Convertor.patch
30/May/10 17:19
19 kB
zhao zhendong

Proposal for GSoC2010 (Linear SVM for Mahout)

Details

Description

Attachments

Attachments

Activity

People

Dates