Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.15.3
-
None
Description
Scenario:
Our user use flink batch to compact small files in one day. Flink version : 1.15
He split pipeline into 24 for each hour. So there are 24 source
I find it costs too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
as follow:
Root Cause:
I got the root cause after check:
- AbstractFileSource will enumerateSplits when createEnumerator
- NotSplittingRecursiveEnumerator need to get fileblockLocation of every fileblock which is a heavy IO operation
Suggestion
- FileSource add option to disable location fetcher
- Move location fetcher into IOExecutor