Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10388

Public dataset loader interface

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • ML
    • None

    Description

      It is very useful to have a public dataset loader to fetch ML datasets from popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation.

      val loader = new DatasetLoader(sqlContext)
      val df = loader.get("libsvm", "rcv1_train.binary")
      

      User should be able to list (or preview) datasets, e.g.

      val datasets = loader.ls("libsvm") // returns a local DataFrame
      datasets.show() // list all datasets under libsvm repo
      

      It would be nice to allow 3rd-party packages to register new repos. Both the API and implementation are pending discussion. Note that this requires http and https support.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mengxr Xiangrui Meng
            Xiangrui Meng Xiangrui Meng
            Votes:
            1 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated: