Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.5.1
-
None
-
None
Description
In the following Jira issue
https://issues.apache.org/jira/browse/SPARK-31962
Two new options, modifiiedBefore and modifiedAfter for batch reads (for example, CSV) where introduced, and eventually merged into version 3.1.1 via PR:
https://issues.apache.org/jira/browse/SPARK-31962
This was introduced in a way that batch reads allow these two options, however a stream is explicitly not allowed.
When loading files from a data source as a stream, there too can be times where thousands of files are within a respective file path. This applies to both batch and stream use cases. Note: The Databricks "cloudFiles" AutoLoader supports these options in a stream.
https://docs.databricks.com/en/ingestion/auto-loader/options.html#id20
Suggested Example Usages
Start stream with all CSV files modified after date:
spark.readStream.option("modifiedAfter","2020-06-15T05:00:00").option("quote", '"').option("escape", '"').csv(source_path)
Start Stream with all CSV files modified before date:
spark.readStream.option("modifiedAfter","2020-06-15T05:00:00").option("quote", '"').option("escape", '"').csv(source_path)
Start stream with all CSV files modified between two dates:
spark.readStream.option("modifiedAfter","2019-06-15T05:00:00").{{{}option("modifiedBefore","2020-06-15T05:00:00")option("quote", '"').option("escape", '"').csv(source_path)}}