Description
When the label of dataset is numeric type, SparkR spark.naiveBayes will throw error. This bug is easy to reproduce:
t <- as.data.frame(Titanic) t1 <- t[t$Freq > 0, -5] t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1) t2 <- t1[-4] df <- suppressWarnings(createDataFrame(sqlContext, t2)) m <- spark.naiveBayes(df, NumericSurvived ~ .) 16/05/05 03:26:17 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66) at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invo
In RFormula, the response variable type could be string or numeric. If it's string, RFormula will transform it to label of DoubleType by StringIndexer and set corresponding column metadata; otherwise, RFormula will directly use it as label when training model (and assumes that it was numbered from 0, ..., maxLabelIndex).
When we extract labels at ml.r.NaiveBayesWrapper, we should handle it according the type of the response variable (string or numeric).
Attachments
Issue Links
- depends upon
-
SPARK-15957 RFormula supports forcing to index label
- Resolved
- Is contained by
-
SPARK-15540 RFormula and R feature processing improvement umbrella
- Resolved
- is duplicated by
-
SPARK-15510 SparkR NaiveBayes should not require label to have NominalAttribute
- Closed
- relates to
-
SPARK-11107 spark.ml should support more input column types: umbrella
- Resolved
- links to