Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.13.1, 0.14.1
Description
1. In Insert mode, when the SubTask is restarted, the OperatorCoordinator is in the notifyCheckpointComplete of CheckpointId-100 for a long time. This may be due to the time-consuming processing of some tableService scanning hdfs, or the time-consuming hdfs execution encountered during Rollback and initInstant.
2. At this time, ckp-meta/instantId.INFLIGHT is not completed, but the corresponding commit file has been submitted. At this time, the bootstrap event will be sent when the subTask restarts.
3. After the OperatorCoordinator completes processing the notifyCheckpointComplete, it will create a new Instant, and the subTask will create the corresponding parquet file, etc. based on the Instant.
4. OperatorCoordinator then processes the bootstrap event, creates another new Instant, and rolls back the Instant created in the third step. This causes OperatorCoordinator and Operator to begin to be inconsistent.
This is related to Hudi's three-stage submission, including data snapshot, submit commit file, and submit ckp_meta file
Attachments
Issue Links
- links to