[SPARK-49051] Provide modifiedAfter and modifiedBefore options when filtering from a stream source - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.5.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

In the following Jira issue
https://issues.apache.org/jira/browse/SPARK-31962

Two new options, modifiiedBefore and modifiedAfter for batch reads (for example, CSV) where introduced, and eventually merged into version 3.1.1 via PR:

https://issues.apache.org/jira/browse/SPARK-31962

This was introduced in a way that batch reads allow these two options, however a stream is explicitly not allowed.

When loading files from a data source as a stream, there too can be times where thousands of files are within a respective file path. This applies to both batch and stream use cases. Note: The Databricks "cloudFiles" AutoLoader supports these options in a stream.

https://docs.databricks.com/en/ingestion/auto-loader/options.html#id20

Suggested Example Usages
Start stream with all CSV files modified after date:
spark.readStream.option("modifiedAfter","2020-06-15T05:00:00").option("quote", '"').option("escape", '"').csv(source_path)

Start Stream with all CSV files modified before date:
spark.readStream.option("modifiedAfter","2020-06-15T05:00:00").option("quote", '"').option("escape", '"').csv(source_path)

Start stream with all CSV files modified between two dates:

spark.readStream.option("modifiedAfter","2019-06-15T05:00:00").{{{}option("modifiedBefore","2020-06-15T05:00:00")option("quote", '"').option("escape", '"').csv(source_path)}}

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jeff Steinmetz

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Jul/24 21:43

Updated:: 29/Jul/24 21:44