[SPARK-6864] Spark's Multilabel Classifier runs out of memory on small datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.2.1
Fix Version/s: 1.2.1
Component/s: MLlib
Labels:
None
Environment:

EC2 with 8-96 instances up to r3.4xlarge
The test fails on every configuration

Target Version/s:

1.2.1

Description

When trying to run Spark's MultiLabel classifier (LogisticRegressionWithLBFGS) on the RCV1 V2 dataset (about 0.5GB, 100 labels), the classifier runs out of memory. The number of tasks per executor doesnt seem to matter. It happens even with a single task per 120 GB executor. The dataset is the concatenation of the test files from the "rcv1v2 (topics; full sets)" group here:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html

Here's the code:

import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import scala.compat.Platform._

val nnodes = 8

val t0=currentTime
// Load training data in LIBSVM format.
val train = MLUtils.loadLibSVMFile(sc, "s3n://bidmach/RCV1train.libsvm", true, 276544, nnodes)
val test = MLUtils.loadLibSVMFile(sc, "s3n://bidmach/RCV1test.libsvm", true, 276544, nnodes)

val t1=currentTime;

val lrAlg = new LogisticRegressionWithLBFGS()

lrAlg.setNumClasses(100).optimizer.
setNumIterations(10).
setRegParam(1e-10).
setUpdater(new L1Updater)

// Run training algorithm to build the model
val model = lrAlg.run(train)

val t2=currentTime

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: John Canny

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Apr/15 20:56

Updated:: 20/Apr/15 15:57

Resolved:: 20/Apr/15 15:57