[FLINK-35739] FLIP-444: Native file copy support - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0-preview
Component/s: Connectors / FileSystem, Runtime / State Backends
Labels:
- pull-request-available

Release Note:

Hide
Users can now configure Flink to use s5cmd to speed up downloading files from S3 during the recovery process, when using RocksDB, by a factor of 2.

(link the docs)

Show
Users can now configure Flink to use s5cmd to speed up downloading files from S3 during the recovery process, when using RocksDB, by a factor of 2. (link the docs)

Description

https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support

State downloading in Flink can be a time and CPU consuming operation, which is especially visible if CPU resources per task slot are strictly restricted to for example a single CPU. Downloading 1GB of state size can take significant amount of time, while the code doing so is quite inefficient.

Currently when downloading state files, Flink is creating an FSDataInputStream from the remote file, and copies its bytes, to an OutputStream pointing to a local file (in the RocksDBStateDownloader#downloadDataForStateHandle method). FSDataInputStream internally is being wrapped by many layers of abstractions and indirections and what’s worse, every file is being copied individually, which leads to quite high overheads for small files. Download times and download process CPU efficiency can be significantly improved if we introduced an API to allow org.apache.flink.core.fs.FileSystem to copy many files natively and all at once.

For S3, there are at least two potential implementations. The first one is using AWS SDKv2 directly (Flink currently is using AWS SDKv1 wrapped by hadoop/presto) and Amazon S3 Transfer Manager. Second option is to use a 3rd party tool called s5cmd. It is claimed to be a faster alternative to the official AWS clients, which was confirmed by our benchmarks.

Attachments

Issue Links

links to

GitHub Pull Request #25028

Sub-Tasks

1.	Provide native file copy support for S3 using s5cmd	Closed	Piotr Nowojski
2.	Use native file copy in RocksDBStateDownloader	Closed	Piotr Nowojski
3.	Interrupt s5cmd call on cancellation	Closed	Piotr Nowojski
4.	Limit s5cmd resource usage	Closed	Piotr Nowojski
5.	Deprecate/remove DuplicatingFileSystem	Closed	Piotr Nowojski
6.	Document s5cmd	Closed	Piotr Nowojski
7.	Wait for state download on cancellation to enforce cleanup	Closed	Piotr Nowojski

Activity

People

Assignee:: Piotr Nowojski

Reporter:: Piotr Nowojski

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Jul/24 08:37

Updated:: 02/Sep/24 09:35

Resolved:: 28/Aug/24 08:23