Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47793

Implement SimpleDataSourceStreamReader for python streaming data source

    XMLWordPrintableJSON

Details

    Description

       SimpleDataSourceStreamReader is a simplified version of the DataStreamReader interface.

      1. It doesn’t require developers to reason about data partitioning.
      2. It doesn’t require getting the latest offset before reading data.

      There are 3 functions that needs to be defined 

      1. Read data and return the end offset.

      def read(self, start: Offset) -> (Iterator[Tuple], Offset)

      2. Read data between start and end offset, this is required for exactly once read.

      def read2(self, start: Offset, end: Offset) -> Iterator[Tuple]

      3. initial start offset of the streaming query.

      def initialOffset() -> dict

      Implementation: Wrap the SimpleDataSourceStreamReader instance in a DataSourceStreamReader internally and make the prefetching and caching transparent to the data source developer. The record prefetched in python process will be sent to JVM as arrow record batches.

      Attachments

        Activity

          People

            Chaoqin Chaoqin Li
            Chaoqin Chaoqin Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: