[SPARK-12208] Abstract the examples into a common place - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 1.5.2
Fix Version/s: None
Component/s: Documentation, MLlib
Labels:
- bulk-closed

Description

When we write examples in the code, we put the generation of the data along with the example itself. We typically have either:

val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
...

or some more esoteric stuff such as:

val data = Array(
  (0, 0.1),
  (1, 0.8),
  (2, 0.2)
)
val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature")

val data = Array(
  Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")

I suggest we follow the example of sklearn and standardize all the generation of example data inside a few methods, for example in org.apache.spark.ml.examples.ExampleData. One reason is that just reading the code is sometimes not enough to figure out what the data is supposed to be. For example when using libsvm_data, it is unclear what the dataframe columns are. This is something we should comment somewhere.
Also, it would help explaining in one place all the scala idiosyncracies such as using Tuple1.apply and such.

Attachments

Issue Links

relates to

SPARK-10383 Sync example code between API doc and user guide

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Timothy Hunter

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Dec/15 17:59

Updated:: 21/May/19 04:34

Resolved:: 21/May/19 04:34