Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15509

R MLlib algorithms should support input columns "features" and "label"

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0
    • ML, SparkR
    • None

    Description

      Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:

      training <- loadDF(sqlContext, ".../mnist", "libsvm")
      model <- naiveBayes(label ~ features, training)
      

      This fails with:

      16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
        java.lang.IllegalArgumentException: Output column features already exists.
      	at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
      	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
      	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
      	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
      	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
      	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
      	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
      	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
      	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
      	at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
      	at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
      	at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
      

      The same issue appears for the "label" column once you rename the "features" column.

      Attachments

        Issue Links

          Activity

            People

              iamshrek Xin Ren
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: