[SPARK-17159] Improve FileInputDStream.findNewFiles list performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 3.0.0
Component/s: DStreams
Labels:
None
Environment:

spark against object stores

Description

FileInputDStream.findNewFiles() is doing a globStatus with a fitler that calls getFileStatus() on every file, takes the output and does listStatus() on the output.

This going to suffer on object stores, as dir listing and getFileStatus calls are so expensive. It's clear this is a problem, as the method has code to detect timeouts in the window and warn of problems.

It should be possible to make this faster

Attachments

Issue Links

is depended upon by

HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries

Resolved

is related to

HADOOP-13946 Document how HDFS updates timestamps in the FS spec; compare with object stores

Resolved

SPARK-20448 Document how FileInputDStream works with object storage

Resolved

relates to

SPARK-7481 Add spark-hadoop-cloud module to pull in object store support

Resolved

links to

[Github] Pull Request #14731 (steveloughran)

[Github] Pull Request #17745 (steveloughran)

[Github] Pull Request #22339 (ScrapCodes)

(2 links to)

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 19/Aug/16 19:04

Updated:: 05/Oct/18 01:22

Resolved:: 05/Oct/18 01:22