Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15153

SparkR spark.naiveBayes throws error when label is numeric type

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0
    • ML, SparkR
    • None

    Description

      When the label of dataset is numeric type, SparkR spark.naiveBayes will throw error. This bug is easy to reproduce:

      t <- as.data.frame(Titanic)
      t1 <- t[t$Freq > 0, -5]
      t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
      t2 <- t1[-4]
      df <- suppressWarnings(createDataFrame(sqlContext, t2))
      m <- spark.naiveBayes(df, NumericSurvived ~ .)
      
      16/05/05 03:26:17 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
        java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
      	at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
      	at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
      	at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
      	at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
      	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
      	at io.netty.channel.AbstractChannelHandlerContext.invo
      

      In RFormula, the response variable type could be string or numeric. If it's string, RFormula will transform it to label of DoubleType by StringIndexer and set corresponding column metadata; otherwise, RFormula will directly use it as label when training model (and assumes that it was numbered from 0, ..., maxLabelIndex).
      When we extract labels at ml.r.NaiveBayesWrapper, we should handle it according the type of the response variable (string or numeric).

      cc mengxr josephkb

      Attachments

        Issue Links

          Activity

            People

              yanboliang Yanbo Liang
              yanboliang Yanbo Liang
              Xiangrui Meng Xiangrui Meng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: