Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44076

SPIP: Python Data Source API

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0
    • None
    • PySpark
    • None

    Description

      This proposal aims to introduce a simple API in Python for Data Sources. The idea is to enable Python developers to create data sources without having to learn Scala or deal with the complexities of the current data source APIs. The goal is to make a Python-based API that is simple and easy to use, thus making Spark more accessible to the wider Python developer community. This proposed approach is based on the recently introduced Python user-defined table functions (SPARK-43797) with extensions to support data sources.

      SPIP: https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing

      Attachments

        1.
        Initial support for Python data source read API Sub-task Resolved Allison Wang
        2.
        Support registering Python data sources Sub-task Resolved Allison Wang
        3.
        Support loading Python data sources in DataFrameReader Sub-task Resolved Allison Wang
        4.
        Add InputPartition to DataSourceReader interface Sub-task Resolved Allison Wang
        5.
        Add Python data source write API Sub-task Resolved Allison Wang
        6.
        Make Python data source registration session level Sub-task Resolved Allison Wang
        7.
        Plan Python data source read using mapInArrow Sub-task Resolved Allison Wang
        8.
        Change saveMode to overwrite for DataSourceWriter constructor Sub-task Resolved Allison Wang
        9.
        Support spark.read.schema(...) for Python data source API Sub-task Resolved Unassigned
        10.
        Respect column names when Python data source read function outputs named Row objects Sub-task Resolved Allison Wang
        11.
        Initial support for Python data source write API Sub-task Resolved Allison Wang
        12.
        Support spark.read.load() with non-empty path for Python data source API Sub-task Open Unassigned
        13.
        Support creating table using a Python data source in SQL Sub-task Resolved Hyukjin Kwon
        14.
        Support `commit` and `abort` API for Python data source write Sub-task Resolved Allison Wang
        15.
        Support overwrite mode for Python data source write Sub-task Resolved Allison Wang
        16.
        Investigate runtime registration and feasibility of overwriting the datasource Sub-task Resolved Unassigned
        17.
        Statically register Python Data Source Sub-task Resolved Hyukjin Kwon
        18.
        Update `path` handling in Python data source Sub-task Resolved Allison Wang
        19.
        Allow non-deterministic Python UDFs in MapInPandas/MapInArrow Sub-task Resolved Allison Wang
        20.
        Support create table using DSv2 sources Sub-task Resolved Allison Wang
        21.
        Support CTAS using DSv2 sources Sub-task Resolved Allison Wang
        22.
        Support INSERT INTO/OVERWRITE using DSv2 sources Sub-task Resolved Allison Wang
        23.
        Add documentation for Python data source API Sub-task Resolved Allison Wang
        24.
        Refactor Python Data Source instance loading Sub-task Resolved Hyukjin Kwon
        25.
        Support PythonSQLMetrics.pythonMetrics Sub-task Resolved Hyukjin Kwon
        26.
        Add a new API in DSv2 DataWriter to write an iterator of records Sub-task Resolved Allison Wang
        27.
        Block Python data source registration with name conflicts Sub-task Resolved Allison Wang
        28.
        Improve error messages for invalid save mode Sub-task Resolved Allison Wang
        29.
        Check Python executable when looking up available Data Sources Sub-task Resolved Hyukjin Kwon
        30.
        Improve Python data source error classes and messages Sub-task Resolved Allison Wang
        31.
        Python data source options should be a case insensitive dictionary Sub-task Resolved Allison Wang
        32.
        Improve error messages for unsupported data source save mode Sub-task Resolved Allison Wang
        33.
        Log full exception when failed to lookup Python Data Sources Sub-task Resolved Hyukjin Kwon
        34.
        Disallow re-registration of statically registered data sources Sub-task Open Unassigned
        35.
        Improve error messages for DATA_SOURCE_NOT_FOUND error Sub-task Resolved Allison Wang
        36.
        Make DataSourceManager isolated and self clone-able Sub-task Resolved Hyukjin Kwon
        37.
        Refactor Python Data Source to align with other built-in Data Sources Sub-task Resolved Hyukjin Kwon
        38.
        Skip test_datasource if PyArrow is not installed Sub-task Resolved Hyukjin Kwon
        39.
        Skip V2 table lookup when a table is in V1 table cache Sub-task Resolved Allison Wang
        40.
        Make daemon mode configurable when creating Python workers Sub-task Resolved Allison Wang
        41.
        Support Python data source API with Spark Connect Sub-task Resolved Allison Wang
        42.
        Fix docstring links and type hints in Python Data Source Sub-task Resolved Hyukjin Kwon
        43.
        Document Python Data Source API in API reference page Sub-task Resolved Hyukjin Kwon
        44.
        Remove the private[sql] modifier for Python data sources Sub-task Resolved Allison Wang
        45.
        Add user guide for batch data source write API Sub-task Resolved Allison Wang
        46.
        Refine Python data source API docstring and type hints Sub-task Resolved Allison Wang
        47.
        Fix Python data source error class references Sub-task Resolved Allison Wang
        48.
        Add a simple data source example in the user guide Sub-task Resolved Allison Wang
        49.
        Make static import Python data source configurable Sub-task Open Unassigned
        50.
        Avoid static Python data source lookup when using builtin or Java data sources Sub-task Resolved Allison Wang
        51.
        Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance Sub-task Resolved Luca Canali
        52.
        Support Arrow-Based Python Data Source Writer Sub-task Resolved Allison Wang
        53.
        Avoid wrapping Python data source error messages thrown during planning Sub-task Resolved Allison Wang

        Activity

          People

            Unassigned Unassigned
            allisonwang-db Allison Wang
            Hyukjin Kwon Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: