Details
Description
FileInputDStream.findNewFiles() is doing a globStatus with a fitler that calls getFileStatus() on every file, takes the output and does listStatus() on the output.
This going to suffer on object stores, as dir listing and getFileStatus calls are so expensive. It's clear this is a problem, as the method has code to detect timeouts in the window and warn of problems.
It should be possible to make this faster
Attachments
Issue Links
- is depended upon by
-
HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries
- Resolved
- is related to
-
HADOOP-13946 Document how HDFS updates timestamps in the FS spec; compare with object stores
- Resolved
-
SPARK-20448 Document how FileInputDStream works with object storage
- Resolved
- relates to
-
SPARK-7481 Add spark-hadoop-cloud module to pull in object store support
- Resolved
- links to