[SPARK-15509] R MLlib algorithms should support input columns "features" and "label" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: ML, SparkR
Labels:
None

Target Version/s:

2.1.0

Description

Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:

training <- loadDF(sqlContext, ".../mnist", "libsvm")
model <- naiveBayes(label ~ features, training)

This fails with:

16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  java.lang.IllegalArgumentException: Output column features already exists.
	at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
	at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
	at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
	at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca

The same issue appears for the "label" column once you rename the "features" column.

Attachments

Issue Links

Is contained by

SPARK-15540 RFormula and R feature processing improvement umbrella

Resolved

links to

[Github] Pull Request #13584 (keypointt)

[Github] Pull Request #14993 (yanboliang)

Activity

People

Assignee:: Xin Ren

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/May/16 18:58

Updated:: 09/Sep/16 03:13

Resolved:: 02/Sep/16 08:55