Description
LibSVMDataSource will attach a special metadata to indicate numFeatures.
scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") 19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan. data: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> data.schema("features").metadata res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
However, ML impls all try to obtain the vector size via AttributeGroup, which can not use this metadata:
scala> import org.apache.spark.ml.attribute._ import org.apache.spark.ml.attribute._scala> AttributeGroup.fromStructField(data.schema("features")).size res1: Int = -1
Attachments
Issue Links
- links to