Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44076 SPIP: Python Data Source API
  3. SPARK-45639

Support loading Python data sources in DataFrameReader

    XMLWordPrintableJSON

Details

    Description

      Allow users to read from a Python data source using `spark.read.format(...).load()` in PySpark. For example

      Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark:

      class MyReader(DataSourceReader):
          def read(self, partition):
              yield (0, 1)
      
      class MyDataSource(DataSource):
          def schema(self):
              return "id INT, value INT"
          
          def reader(self, schema):
              return MyReader()
      
      df = spark.read.format("MyDataSource").load()
      df.show()
      +---+-----+
      | id|value|
      +---+-----+
      |  0|    1|
      +---+-----+
      

       

      Attachments

        Issue Links

          Activity

            People

              allisonwang-db Allison Wang
              allisonwang-db Allison Wang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: