Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15153

SparkR spark.naiveBayes throws error when label is numeric type

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.0
    • Component/s: ML, SparkR
    • Labels:
      None
    • Target Version/s:

      Description

      When the label of dataset is numeric type, SparkR spark.naiveBayes will throw error. This bug is easy to reproduce:

      t <- as.data.frame(Titanic)
      t1 <- t[t$Freq > 0, -5]
      t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
      t2 <- t1[-4]
      df <- suppressWarnings(createDataFrame(sqlContext, t2))
      m <- spark.naiveBayes(df, NumericSurvived ~ .)
      
      16/05/05 03:26:17 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
        java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
      	at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
      	at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
      	at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
      	at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
      	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
      	at io.netty.channel.AbstractChannelHandlerContext.invo
      

      In RFormula, the response variable type could be string or numeric. If it's string, RFormula will transform it to label of DoubleType by StringIndexer and set corresponding column metadata; otherwise, RFormula will directly use it as label when training model (and assumes that it was numbered from 0, ..., maxLabelIndex).
      When we extract labels at ml.r.NaiveBayesWrapper, we should handle it according the type of the response variable (string or numeric).

      cc Xiangrui Meng Joseph K. Bradley

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yanboliang Yanbo Liang
                Reporter:
                yanboliang Yanbo Liang
                Shepherd:
                Xiangrui Meng
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: