Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8360

Structured Streaming (aka Streaming DataFrames)

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • 2.1.0
    • Structured Streaming
    • None

    Description

      Umbrella ticket to track what's needed to make streaming DataFrame a reality.

      Attachments

        Issue Links

        1.
        API design: convergence of batch and streaming DataFrame Sub-task Resolved Reynold Xin Actions
        2.
        Initial infrastructure Sub-task Resolved Michael Armbrust Actions
        3.
        API design: external state management Sub-task Closed Unassigned Actions
        4.
        API for managing streaming dataframes Sub-task Resolved Tathagata Das Actions
        5.
        Add FileStreamSource Sub-task Resolved Shixiong Zhu Actions
        6.
        Remove DataStreamReader/Writer Sub-task Resolved Reynold Xin Actions
        7.
        Rename DataFrameWriter.stream DataFrameWriter.startStream Sub-task Resolved Reynold Xin Actions
        8.
        State Store: A new framework for state management for computing Streaming Aggregates Sub-task Resolved Tathagata Das Actions
        9.
        Old streaming DataFrame proposal by Cheng Hao (Intel) Sub-task Closed Cheng Hao Actions
        10.
        WAL for determistic batches with IDs Sub-task Resolved Michael Armbrust Actions
        11.
        Simple FileSink for Parquet Sub-task Resolved Michael Armbrust Actions
        12.
        Windowing for structured streaming Sub-task Resolved Burak Yavuz Actions
        13.
        Add processing time trigger Sub-task Resolved Shixiong Zhu Actions
        14.
        Streaming Aggregation Sub-task Resolved Michael Armbrust Actions
        15.
        Method to determine if Dataset is bounded or not Sub-task Resolved Burak Yavuz Actions
        16.
        Memory Sink Sub-task Resolved Michael Armbrust Actions
        17.
        Define analysis rules for operations not supported in streaming Sub-task Resolved Tathagata Das Actions
        18.
        Python API for methods introduced for Structured Streaming Sub-task Resolved Burak Yavuz Actions
        19.
        Add partitioned parquet support file stream sink Sub-task Resolved Tathagata Das Actions
        20.
        Refactor DataSource to ensure schema is inferred only once when creating a file stream Sub-task Resolved Tathagata Das Actions
        21.
        Refactor StreamTests to test for source fault-tolerance correctly. Sub-task Resolved Tathagata Das Actions
        22.
        Add support in file stream source for reading new files added to subdirs Sub-task Resolved Tathagata Das Actions
        23.
        Add support for batch jobs correctly inferring partitions from data written with file stream sink Sub-task Resolved Tathagata Das Actions
        24.
        Disable support for multiple streaming aggregations Sub-task Resolved Tathagata Das Actions
        25.
        Disable schema inference for streaming datasets on file streams Sub-task Resolved Tathagata Das Actions
        26.
        Add support for complete output mode Sub-task Resolved Tathagata Das Actions
        27.
        Make continuous Parquet writes consistent with non-continuous Parquet writes Sub-task Closed Unassigned Actions
        28.
        Allow sorting on aggregated streaming dataframe when the output mode is Complete Sub-task Resolved Tathagata Das Actions
        29.
        Add support for socket stream. Sub-task Closed Prashant Sharma Actions
        30.
        Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery Sub-task Resolved Shixiong Zhu Actions
        31.
        Add a unique id to ContinuousQuery Sub-task Resolved Tathagata Das Actions
        32.
        Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer Sub-task Resolved Tathagata Das Actions
        33.
        Renamed ContinuousQuery to StreamingQuery for simplicity Sub-task Resolved Tathagata Das Actions
        34.
        Fix bug in python DataStreamReader Sub-task Resolved Tathagata Das Actions
        35.
        Properly explain the streaming queries Sub-task Resolved Shixiong Zhu Actions
        36.
        Fix complete mode aggregation with console sink Sub-task Resolved Shixiong Zhu Actions
        37.
        Sleep when no new data arrives to avoid 100% CPU usage Sub-task Resolved Shixiong Zhu Actions
        38.
        Enable test for sql/streaming.py and fix these tests Sub-task Resolved Shixiong Zhu Actions
        39.
        HDFSMetadataLog.get leaks the input stream Sub-task Resolved Shixiong Zhu Actions
        40.
        Add ContinuousQueryInfo to make ContinuousQueryListener events serializable Sub-task Resolved Shixiong Zhu Actions
        41.
        Add network word count example Sub-task Resolved James Thomas Actions
        42.
        StreamExecution.awaitOffset may take too long because of thread starvation Sub-task Resolved Shixiong Zhu Actions
        43.
        Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering" Sub-task Resolved Shixiong Zhu Actions
        44.
        Add a file sink log to support versioning and compaction Sub-task Resolved Shixiong Zhu Actions
        45.
        Fix a race condition in StreamExecution.processAllAvailable Sub-task Resolved Shixiong Zhu Actions
        46.
        Fix the race conditions in MemoryStream and MemorySink Sub-task Resolved Shixiong Zhu Actions
        47.
        Move FileSource offset log into checkpointLocation Sub-task Resolved Shixiong Zhu Actions
        48.
        Add a note to warn that onQueryProgress is asynchronous Sub-task Resolved Shixiong Zhu Actions
        49.
        QueryProgress should be post after committedOffsets is updated Sub-task Resolved Shixiong Zhu Actions
        50.
        StateStoreCoordinator should extend ThreadSafeRpcEndpoint Sub-task Resolved Shixiong Zhu Actions
        51.
        Allow multiple continuous queries to be started from the same DataFrame Sub-task Resolved Shixiong Zhu Actions
        52.
        Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite Sub-task Resolved Shixiong Zhu Actions
        53.
        Add MetadataLog and HDFSMetadataLog Sub-task Resolved Shixiong Zhu Actions
        54.
        ContinuousQueryManagerSuite floods the logs with garbage Sub-task Resolved Shixiong Zhu Actions
        55.
        Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering Sub-task Resolved Shixiong Zhu Actions
        56.
        Add ConsoleSink for structure streaming to display the dataframe on the fly Sub-task Resolved Saisai Shao Actions
        57.
        Flaky Test: Complete aggregation with Console sink Sub-task Resolved Shixiong Zhu Actions
        58.
        ConsoleSink should not require checkpointLocation Sub-task Resolved Shixiong Zhu Actions
        59.
        Add Structured Streaming Programming Guide Sub-task Resolved Tathagata Das Actions
        60.
        Move python DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming package Sub-task Resolved Tathagata Das Actions
        61.
        Add an option in file stream source to read 1 file at a time Sub-task Resolved Tathagata Das Actions
        62.
        Fix StreamingQueryListener to return message and stacktrace of actual exception Sub-task Resolved Tathagata Das Actions
        63.
        Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError Sub-task Resolved Tathagata Das Actions
        64.
        Metrics for Structured Streaming Sub-task Resolved Tathagata Das Actions
        65.
        Add methods to convert StreamingQueryStatus to json Sub-task Resolved Tathagata Das Actions
        66.
        History Server is broken because of the refactoring work in Structured Streaming Sub-task Resolved Shixiong Zhu Actions
        67.
        ForeachSink should fail the Spark job if `process` throws exception Sub-task Resolved Shixiong Zhu Actions
        68.
        State Store leaks temporary files Sub-task Resolved Tathagata Das Actions
        69.
        Fix FileStreamSink with aggregation + watermark + append mode Sub-task Resolved Tathagata Das Actions
        70.
        Rename triggerId to batchId in StreamingQueryStatus.triggerDetails Sub-task Resolved Tathagata Das Actions
        71.
        Include triggerDetails in StreamingQueryStatus.json Sub-task Resolved Tathagata Das Actions
        72.
        Improve docs on StreamingQueryListener and StreamingQuery.status Sub-task Resolved Tathagata Das Actions
        73.
        Add StreamingQuery.status in python Sub-task Closed Tathagata Das Actions
        74.
        Enable interrupts for HDFS in HDFSMetadataLog Sub-task Resolved Shixiong Zhu Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            marmbrus Michael Armbrust
            rxin Reynold Xin
            Votes:
            30 Vote for this issue
            Watchers:
            92 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment