Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15509

R MLlib algorithms should support input columns "features" and "label"

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0
    • ML, SparkR
    • None

    Description

      Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:

      training <- loadDF(sqlContext, ".../mnist", "libsvm")
      model <- naiveBayes(label ~ features, training)
      

      This fails with:

      16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
        java.lang.IllegalArgumentException: Output column features already exists.
      	at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
      	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
      	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
      	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
      	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
      	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
      	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
      	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
      	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
      	at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
      	at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
      	at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
      

      The same issue appears for the "label" column once you rename the "features" column.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            iamshrek Xin Ren
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment