Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21643

LR dataset worked in Spark 1.6.3, 2.0.2 stopped working in 2.1.0 onward

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 2.1.0, 2.1.1, 2.2.0
    • None
    • ML
    • None
    • CentOS 7, 256G memory, and 52 CPUs VM

    Description

      This dataset is working on 1.6.x, and 2.0.x. But it is not converging with 2.1+

      a) Download the data set (https://s3.amazonaws.com/manage-partners/pipeline/di873-train.json.gz) and uncompress it, i placed it /tmp/di873-train.json
      b) Download the spark package to /usr/lib/spark/spark-*
      c) cd sbin
      d) start-master.sh
      e) start-slave.sh <master-url>
      f) cd ../bin
      g) Start spark-shell <master-url>
      h) I pasted in the following scala cods:

      import org.apache.spark.sql.types._
      val VT = org.apache.spark.ml.linalg.SQLDataTypes.VectorType
      val schema = StructType(Array(StructField("features", VT,true),StructField("label",DoubleType,true)))

      val df = spark.read.schema(schema).json("file:///tmp/di873-train.json")
      val trainer = new org.apache.spark.ml.classification.LogisticRegression().setMaxIter(500).setElasticNetParam(1.0).setRegParam(0.00001).setTol(0.00001).setFitIntercept(true)
      val model = trainer.fit(df)

      i) Then I monitored the progress in the Spark UI under the Jobs tab.
      With Spark 1.6.1, Spark 2.0.2, the training (treeAggregate), the training finished around 25-30 jobs. But with 2.1+, the trainings were not converging and the training were finished only because they hitted the max iterations (i.e. 500).

      Attachments

        Activity

          People

            Unassigned Unassigned
            thomaskwan Thomas Kwan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: