Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8360

Structured Streaming (aka Streaming DataFrames)

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • 2.1.0
    • Structured Streaming
    • None

    Description

      Umbrella ticket to track what's needed to make streaming DataFrame a reality.

      Attachments

        Issue Links

          1.
          API design: convergence of batch and streaming DataFrame Sub-task Resolved Reynold Xin
          2.
          Initial infrastructure Sub-task Resolved Michael Armbrust
          3.
          API design: external state management Sub-task Closed Unassigned
          4.
          API for managing streaming dataframes Sub-task Resolved Tathagata Das
          5.
          Add FileStreamSource Sub-task Resolved Shixiong Zhu
          6.
          Remove DataStreamReader/Writer Sub-task Resolved Reynold Xin
          7.
          Rename DataFrameWriter.stream DataFrameWriter.startStream Sub-task Resolved Reynold Xin
          8.
          State Store: A new framework for state management for computing Streaming Aggregates Sub-task Resolved Tathagata Das
          9.
          Old streaming DataFrame proposal by Cheng Hao (Intel) Sub-task Closed Cheng Hao
          10.
          WAL for determistic batches with IDs Sub-task Resolved Michael Armbrust
          11.
          Simple FileSink for Parquet Sub-task Resolved Michael Armbrust
          12.
          Windowing for structured streaming Sub-task Resolved Burak Yavuz
          13.
          Add processing time trigger Sub-task Resolved Shixiong Zhu
          14.
          Streaming Aggregation Sub-task Resolved Michael Armbrust
          15.
          Method to determine if Dataset is bounded or not Sub-task Resolved Burak Yavuz
          16.
          Memory Sink Sub-task Resolved Michael Armbrust
          17.
          Define analysis rules for operations not supported in streaming Sub-task Resolved Tathagata Das
          18.
          Python API for methods introduced for Structured Streaming Sub-task Resolved Burak Yavuz
          19.
          Add partitioned parquet support file stream sink Sub-task Resolved Tathagata Das
          20.
          Refactor DataSource to ensure schema is inferred only once when creating a file stream Sub-task Resolved Tathagata Das
          21.
          Refactor StreamTests to test for source fault-tolerance correctly. Sub-task Resolved Tathagata Das
          22.
          Add support in file stream source for reading new files added to subdirs Sub-task Resolved Tathagata Das
          23.
          Add support for batch jobs correctly inferring partitions from data written with file stream sink Sub-task Resolved Tathagata Das
          24.
          Disable support for multiple streaming aggregations Sub-task Resolved Tathagata Das
          25.
          Disable schema inference for streaming datasets on file streams Sub-task Resolved Tathagata Das
          26.
          Add support for complete output mode Sub-task Resolved Tathagata Das
          27.
          Make continuous Parquet writes consistent with non-continuous Parquet writes Sub-task Closed Unassigned
          28.
          Allow sorting on aggregated streaming dataframe when the output mode is Complete Sub-task Resolved Tathagata Das
          29.
          Add support for socket stream. Sub-task Closed Prashant Sharma
          30.
          Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery Sub-task Resolved Shixiong Zhu
          31.
          Add a unique id to ContinuousQuery Sub-task Resolved Tathagata Das
          32.
          Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer Sub-task Resolved Tathagata Das
          33.
          Renamed ContinuousQuery to StreamingQuery for simplicity Sub-task Resolved Tathagata Das
          34.
          Fix bug in python DataStreamReader Sub-task Resolved Tathagata Das
          35.
          Properly explain the streaming queries Sub-task Resolved Shixiong Zhu
          36.
          Fix complete mode aggregation with console sink Sub-task Resolved Shixiong Zhu
          37.
          Sleep when no new data arrives to avoid 100% CPU usage Sub-task Resolved Shixiong Zhu
          38.
          Enable test for sql/streaming.py and fix these tests Sub-task Resolved Shixiong Zhu
          39.
          HDFSMetadataLog.get leaks the input stream Sub-task Resolved Shixiong Zhu
          40.
          Add ContinuousQueryInfo to make ContinuousQueryListener events serializable Sub-task Resolved Shixiong Zhu
          41.
          Add network word count example Sub-task Resolved James Thomas
          42.
          StreamExecution.awaitOffset may take too long because of thread starvation Sub-task Resolved Shixiong Zhu
          43.
          Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering" Sub-task Resolved Shixiong Zhu
          44.
          Add a file sink log to support versioning and compaction Sub-task Resolved Shixiong Zhu
          45.
          Fix a race condition in StreamExecution.processAllAvailable Sub-task Resolved Shixiong Zhu
          46.
          Fix the race conditions in MemoryStream and MemorySink Sub-task Resolved Shixiong Zhu
          47.
          Move FileSource offset log into checkpointLocation Sub-task Resolved Shixiong Zhu
          48.
          Add a note to warn that onQueryProgress is asynchronous Sub-task Resolved Shixiong Zhu
          49.
          QueryProgress should be post after committedOffsets is updated Sub-task Resolved Shixiong Zhu
          50.
          StateStoreCoordinator should extend ThreadSafeRpcEndpoint Sub-task Resolved Shixiong Zhu
          51.
          Allow multiple continuous queries to be started from the same DataFrame Sub-task Resolved Shixiong Zhu
          52.
          Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite Sub-task Resolved Shixiong Zhu
          53.
          Add MetadataLog and HDFSMetadataLog Sub-task Resolved Shixiong Zhu
          54.
          ContinuousQueryManagerSuite floods the logs with garbage Sub-task Resolved Shixiong Zhu
          55.
          Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering Sub-task Resolved Shixiong Zhu
          56.
          Add ConsoleSink for structure streaming to display the dataframe on the fly Sub-task Resolved Saisai Shao
          57.
          Flaky Test: Complete aggregation with Console sink Sub-task Resolved Shixiong Zhu
          58.
          ConsoleSink should not require checkpointLocation Sub-task Resolved Shixiong Zhu
          59.
          Add Structured Streaming Programming Guide Sub-task Resolved Tathagata Das
          60.
          Move python DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming package Sub-task Resolved Tathagata Das
          61.
          Add an option in file stream source to read 1 file at a time Sub-task Resolved Tathagata Das
          62.
          Fix StreamingQueryListener to return message and stacktrace of actual exception Sub-task Resolved Tathagata Das
          63.
          Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError Sub-task Resolved Tathagata Das
          64.
          Metrics for Structured Streaming Sub-task Resolved Tathagata Das
          65.
          Add methods to convert StreamingQueryStatus to json Sub-task Resolved Tathagata Das
          66.
          History Server is broken because of the refactoring work in Structured Streaming Sub-task Resolved Shixiong Zhu
          67.
          ForeachSink should fail the Spark job if `process` throws exception Sub-task Resolved Shixiong Zhu
          68.
          State Store leaks temporary files Sub-task Resolved Tathagata Das
          69.
          Fix FileStreamSink with aggregation + watermark + append mode Sub-task Resolved Tathagata Das
          70.
          Rename triggerId to batchId in StreamingQueryStatus.triggerDetails Sub-task Resolved Tathagata Das
          71.
          Include triggerDetails in StreamingQueryStatus.json Sub-task Resolved Tathagata Das
          72.
          Improve docs on StreamingQueryListener and StreamingQuery.status Sub-task Resolved Tathagata Das
          73.
          Add StreamingQuery.status in python Sub-task Closed Tathagata Das
          74.
          Enable interrupts for HDFS in HDFSMetadataLog Sub-task Resolved Shixiong Zhu

          Activity

            People

              marmbrus Michael Armbrust
              rxin Reynold Xin
              Votes:
              30 Vote for this issue
              Watchers:
              92 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: