Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15689

Data source API v2

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • SQL

    Description

      This ticket tracks progress in creating the v2 of data source API. This new API should focus on:

      1. Have a small surface so it is easy to freeze and maintain compatibility for a long time. Ideally, this API should survive architectural rewrites and user-facing API revamps of Spark.

      2. Have a well-defined column batch interface for high performance. Convenience methods should exist to convert row-oriented formats into column batches for data source developers.

      3. Still support filter push down, similar to the existing API.

      4. Nice-to-have: support additional common operators, including limit and sampling.

      Note that both 1 and 2 are problems that the current data source API (v1) suffers. The current data source API has a wide surface with dependency on DataFrame/SQLContext, making the data source API compatibility depending on the upper level API. The current data source API is also only row oriented and has to go through an expensive external data type conversion to internal data type.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            cloud_fan Wenchen Fan
            rxin Reynold Xin
            Reynold Xin Reynold Xin
            Votes:
            2 Vote for this issue
            Watchers:
            96 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment