Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.2.0
-
None
Description
FileStreamSink produces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log.
Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark.
Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily.
Seems we need a config for ignoring the metadata of FileStreamSink when reading the output path.