[SPARK-10117] Implement SQL data source API for reading LIBSVM data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.0
Component/s: ML
Labels:
None

Target Version/s:

1.6.0

Description

It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.

import org.apache.spark.ml.source.libsvm._

val training = sqlContext.read
  .format("libsvm")
  .option("numFeatures", "10000")
  .load("path")

This JIRA covers the following:

1. Read LIBSVM data as a DataFrame with two columns: label: Double and features: Vector.
2. Accept `numFeatures` as an option.
3. The implementation should live under `org.apache.spark.ml.source.libsvm`.

Attachments

Issue Links

is depended upon by

SPARK-10518 Update code examples in spark.ml user guide to use LIBSVM data source instead of MLUtils

Resolved

is related to

SPARK-10537 Document LIBSVM data source options in public doc and minor improvements

Resolved

relates to

SPARK-11622 Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter

Resolved

links to

[Github] Pull Request #8537 (Lewuathe)

Activity

People

Assignee:: Kai

Reporter:: Xiangrui Meng

Shepherd:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Aug/15 16:23

Updated:: 11/Nov/15 02:03

Resolved:: 09/Sep/15 16:29