[SPARK-15458] Disable schema inference for streaming datasets on file streams - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Structured Streaming
Labels:
None

Target Version/s:

2.0.0

Description

If the user relies on the schema to be inferred in file streams can break easily for multiple reasons

accidentally running on a directory which has no data
schema changing underneath
on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.

To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.

Attachments

Issue Links

links to

[Github] Pull Request #13238 (tdas)

Activity

People

Assignee:: Tathagata Das

Reporter:: Tathagata Das

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/May/16 01:31

Updated:: 01/Nov/16 22:15

Resolved:: 24/May/16 21:28