[SPARK-8360] Structured Streaming (aka Streaming DataFrames) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: Structured Streaming
Labels:
None

Description

Umbrella ticket to track what's needed to make streaming DataFrame a reality.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
14/Mar/16 22:00
404 kB
Reynold Xin

Issue Links

incorporates

SPARK-16350 Complete output mode does not output updated aggregated value in Structured Streaming

Resolved

is duplicated by

SPARK-1363 Add streaming support for Spark SQL module

Resolved

relates to

SPARK-9999 Dataset API on top of Catalyst/DataFrame

Resolved

links to

Structured Streaming Programming Abstraction, Semantics, and APIs - Google Docs version

Sub-Tasks

1.	API design: convergence of batch and streaming DataFrame	Resolved	Reynold Xin
2.	Initial infrastructure	Resolved	Michael Armbrust
3.	API design: external state management	Closed	Unassigned
4.	API for managing streaming dataframes	Resolved	Tathagata Das
5.	Add FileStreamSource	Resolved	Shixiong Zhu
6.	Remove DataStreamReader/Writer	Resolved	Reynold Xin
7.	Rename DataFrameWriter.stream DataFrameWriter.startStream	Resolved	Reynold Xin
8.	State Store: A new framework for state management for computing Streaming Aggregates	Resolved	Tathagata Das
9.	Old streaming DataFrame proposal by Cheng Hao (Intel)	Closed	Cheng Hao
10.	WAL for determistic batches with IDs	Resolved	Michael Armbrust
11.	Simple FileSink for Parquet	Resolved	Michael Armbrust
12.	Windowing for structured streaming	Resolved	Burak Yavuz
13.	Add processing time trigger	Resolved	Shixiong Zhu
14.	Streaming Aggregation	Resolved	Michael Armbrust
15.	Method to determine if Dataset is bounded or not	Resolved	Burak Yavuz
16.	Memory Sink	Resolved	Michael Armbrust
17.	Define analysis rules for operations not supported in streaming	Resolved	Tathagata Das
18.	Python API for methods introduced for Structured Streaming	Resolved	Burak Yavuz
19.	Add partitioned parquet support file stream sink	Resolved	Tathagata Das
20.	Refactor DataSource to ensure schema is inferred only once when creating a file stream	Resolved	Tathagata Das
21.	Refactor StreamTests to test for source fault-tolerance correctly.	Resolved	Tathagata Das
22.	Add support in file stream source for reading new files added to subdirs	Resolved	Tathagata Das
23.	Add support for batch jobs correctly inferring partitions from data written with file stream sink	Resolved	Tathagata Das
24.	Disable support for multiple streaming aggregations	Resolved	Tathagata Das
25.	Disable schema inference for streaming datasets on file streams	Resolved	Tathagata Das
26.	Add support for complete output mode	Resolved	Tathagata Das
27.	Make continuous Parquet writes consistent with non-continuous Parquet writes	Closed	Unassigned
28.	Allow sorting on aggregated streaming dataframe when the output mode is Complete	Resolved	Tathagata Das
29.	Add support for socket stream.	Closed	Prashant Sharma
30.	Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery	Resolved	Shixiong Zhu
31.	Add a unique id to ContinuousQuery	Resolved	Tathagata Das
32.	Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer	Resolved	Tathagata Das
33.	Renamed ContinuousQuery to StreamingQuery for simplicity	Resolved	Tathagata Das
34.	Fix bug in python DataStreamReader	Resolved	Tathagata Das
35.	Properly explain the streaming queries	Resolved	Shixiong Zhu
36.	Fix complete mode aggregation with console sink	Resolved	Shixiong Zhu
37.	Sleep when no new data arrives to avoid 100% CPU usage	Resolved	Shixiong Zhu
38.	Enable test for sql/streaming.py and fix these tests	Resolved	Shixiong Zhu
39.	HDFSMetadataLog.get leaks the input stream	Resolved	Shixiong Zhu
40.	Add ContinuousQueryInfo to make ContinuousQueryListener events serializable	Resolved	Shixiong Zhu
41.	Add network word count example	Resolved	James Thomas
42.	StreamExecution.awaitOffset may take too long because of thread starvation	Resolved	Shixiong Zhu
43.	Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"	Resolved	Shixiong Zhu
44.	Add a file sink log to support versioning and compaction	Resolved	Shixiong Zhu
45.	Fix a race condition in StreamExecution.processAllAvailable	Resolved	Shixiong Zhu
46.	Fix the race conditions in MemoryStream and MemorySink	Resolved	Shixiong Zhu
47.	Move FileSource offset log into checkpointLocation	Resolved	Shixiong Zhu
48.	Add a note to warn that onQueryProgress is asynchronous	Resolved	Shixiong Zhu
49.	QueryProgress should be post after committedOffsets is updated	Resolved	Shixiong Zhu
50.	StateStoreCoordinator should extend ThreadSafeRpcEndpoint	Resolved	Shixiong Zhu
51.	Allow multiple continuous queries to be started from the same DataFrame	Resolved	Shixiong Zhu
52.	Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite	Resolved	Shixiong Zhu
53.	Add MetadataLog and HDFSMetadataLog	Resolved	Shixiong Zhu
54.	ContinuousQueryManagerSuite floods the logs with garbage	Resolved	Shixiong Zhu
55.	Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering	Resolved	Shixiong Zhu
56.	Add ConsoleSink for structure streaming to display the dataframe on the fly	Resolved	Saisai Shao
57.	Flaky Test: Complete aggregation with Console sink	Resolved	Shixiong Zhu
58.	ConsoleSink should not require checkpointLocation	Resolved	Shixiong Zhu
59.	Add Structured Streaming Programming Guide	Resolved	Tathagata Das
60.	Move python DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming package	Resolved	Tathagata Das
61.	Add an option in file stream source to read 1 file at a time	Resolved	Tathagata Das
62.	Fix StreamingQueryListener to return message and stacktrace of actual exception	Resolved	Tathagata Das
63.	Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError	Resolved	Tathagata Das
64.	Metrics for Structured Streaming	Resolved	Tathagata Das
65.	Add methods to convert StreamingQueryStatus to json	Resolved	Tathagata Das
66.	History Server is broken because of the refactoring work in Structured Streaming	Resolved	Shixiong Zhu
67.	ForeachSink should fail the Spark job if `process` throws exception	Resolved	Shixiong Zhu
68.	State Store leaks temporary files	Resolved	Tathagata Das
69.	Fix FileStreamSink with aggregation + watermark + append mode	Resolved	Tathagata Das
70.	Rename triggerId to batchId in StreamingQueryStatus.triggerDetails	Resolved	Tathagata Das
71.	Include triggerDetails in StreamingQueryStatus.json	Resolved	Tathagata Das
72.	Improve docs on StreamingQueryListener and StreamingQuery.status	Resolved	Tathagata Das
73.	Add StreamingQuery.status in python	Closed	Tathagata Das
74.	Enable interrupts for HDFS in HDFSMetadataLog	Resolved	Shixiong Zhu

Activity

People

Assignee:: Michael Armbrust

Reporter:: Reynold Xin

Votes:: 30 Vote for this issue

Watchers:: 92 Start watching this issue

Dates

Created:: 14/Jun/15 07:26

Updated:: 01/Nov/16 23:44

Resolved:: 01/Nov/16 23:44