Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6864

Spark's Multilabel Classifier runs out of memory on small datasets

    XMLWordPrintableJSON

    Details

    • Type: Test
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.2.1
    • Fix Version/s: 1.2.1
    • Component/s: MLlib
    • Labels:
      None
    • Environment:

      EC2 with 8-96 instances up to r3.4xlarge
      The test fails on every configuration

    • Target Version/s:

      Description

      When trying to run Spark's MultiLabel classifier (LogisticRegressionWithLBFGS) on the RCV1 V2 dataset (about 0.5GB, 100 labels), the classifier runs out of memory. The number of tasks per executor doesnt seem to matter. It happens even with a single task per 120 GB executor. The dataset is the concatenation of the test files from the "rcv1v2 (topics; full sets)" group here:
      http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html

      Here's the code:

      import org.apache.spark.SparkContext
      import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
      import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
      import org.apache.spark.mllib.optimization.L1Updater
      import org.apache.spark.mllib.regression.LabeledPoint
      import org.apache.spark.mllib.linalg.Vectors
      import org.apache.spark.mllib.util.MLUtils
      import scala.compat.Platform._

      val nnodes = 8

      val t0=currentTime
      // Load training data in LIBSVM format.
      val train = MLUtils.loadLibSVMFile(sc, "s3n://bidmach/RCV1train.libsvm", true, 276544, nnodes)
      val test = MLUtils.loadLibSVMFile(sc, "s3n://bidmach/RCV1test.libsvm", true, 276544, nnodes)

      val t1=currentTime;

      val lrAlg = new LogisticRegressionWithLBFGS()

      lrAlg.setNumClasses(100).optimizer.
      setNumIterations(10).
      setRegParam(1e-10).
      setUpdater(new L1Updater)

      // Run training algorithm to build the model
      val model = lrAlg.run(train)

      val t2=currentTime

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jcanny John Canny
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: