[SPARK-21643] LR dataset worked in Spark 1.6.3, 2.0.2 stopped working in 2.1.0 onward - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 2.1.0, 2.1.1, 2.2.0
Fix Version/s: None
Component/s: ML
Labels:
None
Environment:

CentOS 7, 256G memory, and 52 CPUs VM

Description

This dataset is working on 1.6.x, and 2.0.x. But it is not converging with 2.1+

a) Download the data set (https://s3.amazonaws.com/manage-partners/pipeline/di873-train.json.gz) and uncompress it, i placed it /tmp/di873-train.json
b) Download the spark package to /usr/lib/spark/spark-*
c) cd sbin
d) start-master.sh
e) start-slave.sh <master-url>
f) cd ../bin
g) Start spark-shell <master-url>
h) I pasted in the following scala cods:

import org.apache.spark.sql.types._
val VT = org.apache.spark.ml.linalg.SQLDataTypes.VectorType
val schema = StructType(Array(StructField("features", VT,true),StructField("label",DoubleType,true)))

val df = spark.read.schema(schema).json("file:///tmp/di873-train.json")
val trainer = new org.apache.spark.ml.classification.LogisticRegression().setMaxIter(500).setElasticNetParam(1.0).setRegParam(0.00001).setTol(0.00001).setFitIntercept(true)
val model = trainer.fit(df)

i) Then I monitored the progress in the Spark UI under the Jobs tab.
With Spark 1.6.1, Spark 2.0.2, the training (treeAggregate), the training finished around 25-30 jobs. But with 2.1+, the trainings were not converging and the training were finished only because they hitted the max iterations (i.e. 500).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Kwan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Aug/17 21:34

Updated:: 05/Aug/17 09:08

Resolved:: 05/Aug/17 09:08