[SPARK-8360] Structured Streaming (aka Streaming DataFrames) - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: Structured Streaming
Labels:
None

Description

Umbrella ticket to track what's needed to make streaming DataFrame a reality.

Attachments

StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
14/Mar/16 22:00
404 kB
Reynold Xin

Issue Links

Add Link

incorporates

SPARK-16350 Complete output mode does not output updated aggregated value in Structured Streaming

Resolved

Delete this link

is duplicated by

SPARK-1363 Add streaming support for Spark SQL module

Resolved

Delete this link

relates to

SPARK-9999 Dataset API on top of Catalyst/DataFrame

Resolved

Delete this link

links to

Structured Streaming Programming Abstraction, Semantics, and APIs - Google Docs version

Delete this link

Sub-Tasks

Create Sub-Task

1.	API design: convergence of batch and streaming DataFrame	Resolved	Reynold Xin	Actions
2.	Initial infrastructure	Resolved	Michael Armbrust	Actions
3.	API design: external state management	Closed	Unassigned	Actions
4.	API for managing streaming dataframes	Resolved	Tathagata Das	Actions
5.	Add FileStreamSource	Resolved	Shixiong Zhu	Actions
6.	Remove DataStreamReader/Writer	Resolved	Reynold Xin	Actions
7.	Rename DataFrameWriter.stream DataFrameWriter.startStream	Resolved	Reynold Xin	Actions
8.	State Store: A new framework for state management for computing Streaming Aggregates	Resolved	Tathagata Das	Actions
9.	Old streaming DataFrame proposal by Cheng Hao (Intel)	Closed	Cheng Hao	Actions
10.	WAL for determistic batches with IDs	Resolved	Michael Armbrust	Actions
11.	Simple FileSink for Parquet	Resolved	Michael Armbrust	Actions
12.	Windowing for structured streaming	Resolved	Burak Yavuz	Actions
13.	Add processing time trigger	Resolved	Shixiong Zhu	Actions
14.	Streaming Aggregation	Resolved	Michael Armbrust	Actions
15.	Method to determine if Dataset is bounded or not	Resolved	Burak Yavuz	Actions
16.	Memory Sink	Resolved	Michael Armbrust	Actions
17.	Define analysis rules for operations not supported in streaming	Resolved	Tathagata Das	Actions
18.	Python API for methods introduced for Structured Streaming	Resolved	Burak Yavuz	Actions
19.	Add partitioned parquet support file stream sink	Resolved	Tathagata Das	Actions
20.	Refactor DataSource to ensure schema is inferred only once when creating a file stream	Resolved	Tathagata Das	Actions
21.	Refactor StreamTests to test for source fault-tolerance correctly.	Resolved	Tathagata Das	Actions
22.	Add support in file stream source for reading new files added to subdirs	Resolved	Tathagata Das	Actions
23.	Add support for batch jobs correctly inferring partitions from data written with file stream sink	Resolved	Tathagata Das	Actions
24.	Disable support for multiple streaming aggregations	Resolved	Tathagata Das	Actions
25.	Disable schema inference for streaming datasets on file streams	Resolved	Tathagata Das	Actions
26.	Add support for complete output mode	Resolved	Tathagata Das	Actions
27.	Make continuous Parquet writes consistent with non-continuous Parquet writes	Closed	Unassigned	Actions
28.	Allow sorting on aggregated streaming dataframe when the output mode is Complete	Resolved	Tathagata Das	Actions
29.	Add support for socket stream.	Closed	Prashant Sharma	Actions
30.	Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery	Resolved	Shixiong Zhu	Actions
31.	Add a unique id to ContinuousQuery	Resolved	Tathagata Das	Actions
32.	Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer	Resolved	Tathagata Das	Actions
33.	Renamed ContinuousQuery to StreamingQuery for simplicity	Resolved	Tathagata Das	Actions
34.	Fix bug in python DataStreamReader	Resolved	Tathagata Das	Actions
35.	Properly explain the streaming queries	Resolved	Shixiong Zhu	Actions
36.	Fix complete mode aggregation with console sink	Resolved	Shixiong Zhu	Actions
37.	Sleep when no new data arrives to avoid 100% CPU usage	Resolved	Shixiong Zhu	Actions
38.	Enable test for sql/streaming.py and fix these tests	Resolved	Shixiong Zhu	Actions
39.	HDFSMetadataLog.get leaks the input stream	Resolved	Shixiong Zhu	Actions
40.	Add ContinuousQueryInfo to make ContinuousQueryListener events serializable	Resolved	Shixiong Zhu	Actions
41.	Add network word count example	Resolved	James Thomas	Actions
42.	StreamExecution.awaitOffset may take too long because of thread starvation	Resolved	Shixiong Zhu	Actions
43.	Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"	Resolved	Shixiong Zhu	Actions
44.	Add a file sink log to support versioning and compaction	Resolved	Shixiong Zhu	Actions
45.	Fix a race condition in StreamExecution.processAllAvailable	Resolved	Shixiong Zhu	Actions
46.	Fix the race conditions in MemoryStream and MemorySink	Resolved	Shixiong Zhu	Actions
47.	Move FileSource offset log into checkpointLocation	Resolved	Shixiong Zhu	Actions
48.	Add a note to warn that onQueryProgress is asynchronous	Resolved	Shixiong Zhu	Actions
49.	QueryProgress should be post after committedOffsets is updated	Resolved	Shixiong Zhu	Actions
50.	StateStoreCoordinator should extend ThreadSafeRpcEndpoint	Resolved	Shixiong Zhu	Actions
51.	Allow multiple continuous queries to be started from the same DataFrame	Resolved	Shixiong Zhu	Actions
52.	Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite	Resolved	Shixiong Zhu	Actions
53.	Add MetadataLog and HDFSMetadataLog	Resolved	Shixiong Zhu	Actions
54.	ContinuousQueryManagerSuite floods the logs with garbage	Resolved	Shixiong Zhu	Actions
55.	Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering	Resolved	Shixiong Zhu	Actions
56.	Add ConsoleSink for structure streaming to display the dataframe on the fly	Resolved	Saisai Shao	Actions
57.	Flaky Test: Complete aggregation with Console sink	Resolved	Shixiong Zhu	Actions
58.	ConsoleSink should not require checkpointLocation	Resolved	Shixiong Zhu	Actions
59.	Add Structured Streaming Programming Guide	Resolved	Tathagata Das	Actions
60.	Move python DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming package	Resolved	Tathagata Das	Actions
61.	Add an option in file stream source to read 1 file at a time	Resolved	Tathagata Das	Actions
62.	Fix StreamingQueryListener to return message and stacktrace of actual exception	Resolved	Tathagata Das	Actions
63.	Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError	Resolved	Tathagata Das	Actions
64.	Metrics for Structured Streaming	Resolved	Tathagata Das	Actions
65.	Add methods to convert StreamingQueryStatus to json	Resolved	Tathagata Das	Actions
66.	History Server is broken because of the refactoring work in Structured Streaming	Resolved	Shixiong Zhu	Actions
67.	ForeachSink should fail the Spark job if `process` throws exception	Resolved	Shixiong Zhu	Actions
68.	State Store leaks temporary files	Resolved	Tathagata Das	Actions
69.	Fix FileStreamSink with aggregation + watermark + append mode	Resolved	Tathagata Das	Actions
70.	Rename triggerId to batchId in StreamingQueryStatus.triggerDetails	Resolved	Tathagata Das	Actions
71.	Include triggerDetails in StreamingQueryStatus.json	Resolved	Tathagata Das	Actions
72.	Improve docs on StreamingQueryListener and StreamingQuery.status	Resolved	Tathagata Das	Actions
73.	Add StreamingQuery.status in python	Closed	Tathagata Das	Actions
74.	Enable interrupts for HDFS in HDFSMetadataLog	Resolved	Shixiong Zhu	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Michael Armbrust

Reporter:: Reynold Xin

Votes:: 30 Vote for this issue

Watchers:: 92 Start watching this issue

Dates

Created:: 14/Jun/15 07:26

Updated:: 01/Nov/16 23:44

Resolved:: 01/Nov/16 23:44

Agile

View on Board

Structured Streaming (aka Streaming DataFrames)

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment