Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10523

SparkR formula syntax to turn strings/factors into numerics



    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • ML, SparkR
    • None


      In normal (non SparkR) R the formula syntax enables strings or factors to be turned into dummy variables immediately when calling a classifier. This way, the following R pattern is legal and often used:

      df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
      glm(class ~ i, family = "binomial", data = df)

      The glm method will know that `class` is a string/factor and handles it appropriately by casting it to a 0/1 array before applying any machine learning. SparkR doesn't do this.

      > ddf <- sqlContext %>% 
      > glm(class ~ i, family = "binomial", data = ddf)
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
        java.lang.IllegalArgumentException: Unsupported type for label: StringType
      	at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
      	at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
      	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
      	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
      	at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
      	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
      	at org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
      	at org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.refl

      This can be fixed by doing a bit of manual labor. SparkR does accept booleans as if they are integers here.

      > ddf <- ddf %>% 
        withColumn("to_pred", .$class == "a") 
      > glm(to_pred ~ i, family = "binomial", data = ddf)

      But this can become quite tedious, especially when you want to have models that are using multiple classes that need classification. This is perhaps less relevant for logistic regression (because it is a bit more like a one-off classification approach) but it certainly is relevant if you would want to use a formula for a randomforest and a column denotes, say, a type of flower from the iris dataset.

      Is there a good reason why this should not be a feature of formulas in Spark? I am aware of issue 8774, which looks like it is adressing a similar theme but a different issue.


        Issue Links



              Unassigned Unassigned
              cantdutchthis Vincent Warmerdam
              Shivaram Venkataraman Shivaram Venkataraman
              0 Vote for this issue
              6 Start watching this issue