[SPARK-10523] SparkR formula syntax to turn strings/factors into numerics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: ML, SparkR
Labels:
None

Description

In normal (non SparkR) R the formula syntax enables strings or factors to be turned into dummy variables immediately when calling a classifier. This way, the following R pattern is legal and often used:

library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)

The glm method will know that `class` is a string/factor and handles it appropriately by casting it to a 0/1 array before applying any machine learning. SparkR doesn't do this.

> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
	at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
	at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
	at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
	at org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
	at org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.refl

This can be fixed by doing a bit of manual labor. SparkR does accept booleans as if they are integers here.

> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)

But this can become quite tedious, especially when you want to have models that are using multiple classes that need classification. This is perhaps less relevant for logistic regression (because it is a bit more like a one-off classification approach) but it certainly is relevant if you would want to use a formula for a randomforest and a column denotes, say, a type of flower from the iris dataset.

Is there a good reason why this should not be a feature of formulas in Spark? I am aware of issue 8774, which looks like it is adressing a similar theme but a different issue.

Attachments

Issue Links

duplicates

SPARK-11349 Support transform string label for RFormula

Resolved

Is contained by

SPARK-15540 RFormula and R feature processing improvement umbrella

Resolved

relates to

SPARK-7159 Support multiclass logistic regression in spark.ml

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Vincent Warmerdam

Shepherd:: Shivaram Venkataraman

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/Sep/15 23:13

Updated:: 21/Dec/16 21:53

Resolved:: 07/Nov/16 10:24