[SPARK-8605] Exclude files in StreamingContext. textFileStream(directory) - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: DStreams
Labels:
- streaming
- streaming_api

Description

Currenly, spark streaming can monitor a directory and it will process the newly added files. This will cause a bug if the files copied to the directory are big. For example, in hdfs, if a file is being copied, its name is file_name.COPYING. Spark will pick up the file and process. However, when it's done copying the file, the file name becomes file_name. This would cause FileDoesNotExist error. It would be great if we can exclude files using regex in the directory.