[FLINK-25335] HiveSourceFileEnumerator should fetch splits asynchronously - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.12.1, 1.14.2
Fix Version/s: None
Component/s: Connectors / Hive, Runtime / Coordination
Labels:
- pull-request-available

Description

When submit olap query by flink client to Flink Session Cluster, the JobMaster will start scheduling and enumerate the hive source split by `HiveSourceFileEnumerator`, and then deploy the query task and execute it. if the source table has a lot of partition and the partition file is big, the source split enumerate will cost a lot of time, which would block the task deployment & execution for a long time, and the dashboard can not appear

it would be better to Asynchronous enumerate the hive split, and meanwhile deploy the query task and execute it. when the deployment is finished, source operator fetch split and read data, and the split enumeration is also going on.