Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30347

LibSVMDataSource attach AttributeGroup

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • ML
    • None

    Description

      LibSVMDataSource will attach a special metadata to indicate numFeatures.

       scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
      19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
      data: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> data.schema("features").metadata
      res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
      

      However, ML impls all try to obtain the vector size via AttributeGroup, which can not use this metadata:

      scala> import org.apache.spark.ml.attribute._
      import org.apache.spark.ml.attribute._scala> AttributeGroup.fromStructField(data.schema("features")).size
      res1: Int = -1
       

       

       

      Attachments

        Issue Links

          Activity

            People

              podongfeng Ruifeng Zheng
              podongfeng Ruifeng Zheng
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: