[SPARK-30347] LibSVMDataSource attach AttributeGroup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: ML
Labels:
None

Description

LibSVMDataSource will attach a special metadata to indicate numFeatures.

 scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
data: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> data.schema("features").metadata
res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}

However, ML impls all try to obtain the vector size via AttributeGroup, which can not use this metadata:

scala> import org.apache.spark.ml.attribute._
import org.apache.spark.ml.attribute._scala> AttributeGroup.fromStructField(data.schema("features")).size
res1: Int = -1

Attachments

Issue Links

links to

GitHub Pull Request #27003

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Dec/19 10:43

Updated:: 26/Dec/19 02:04

Resolved:: 26/Dec/19 02:03