Description
In normal (non SparkR) R the formula syntax enables strings or factors to be turned into dummy variables immediately when calling a classifier. This way, the following R pattern is legal and often used:
library(magrittr) df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6)) glm(class ~ i, family = "binomial", data = df)
The glm method will know that `class` is a string/factor and handles it appropriately by casting it to a 0/1 array before applying any machine learning. SparkR doesn't do this.
> ddf <- sqlContext %>% createDataFrame(df) > glm(class ~ i, family = "binomial", data = ddf) Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.IllegalArgumentException: Unsupported type for label: StringType at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185) at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134) at org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46) at org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.refl
This can be fixed by doing a bit of manual labor. SparkR does accept booleans as if they are integers here.
> ddf <- ddf %>% withColumn("to_pred", .$class == "a") > glm(to_pred ~ i, family = "binomial", data = ddf)
But this can become quite tedious, especially when you want to have models that are using multiple classes that need classification. This is perhaps less relevant for logistic regression (because it is a bit more like a one-off classification approach) but it certainly is relevant if you would want to use a formula for a randomforest and a column denotes, say, a type of flower from the iris dataset.
Is there a good reason why this should not be a feature of formulas in Spark? I am aware of issue 8774, which looks like it is adressing a similar theme but a different issue.
Attachments
Issue Links
- duplicates
-
SPARK-11349 Support transform string label for RFormula
- Resolved
- Is contained by
-
SPARK-15540 RFormula and R feature processing improvement umbrella
- Resolved
- relates to
-
SPARK-7159 Support multiclass logistic regression in spark.ml
- Resolved