[FLINK-36701] Pipeline failover again after handling a schema change event as the first event after a failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: cdc-3.2.0, cdc-3.3.0
Fix Version/s: cdc-3.3.0
Component/s: Flink CDC
Labels:
- pull-request-available

Description

Currently, directly after a failover, when the pipeline first handles a schema change event (e.g. addColumnEvent) and then a DataChangeEvent, it may cause the job to fail again as sink has repeatedly applied that schema change.

The cause of the problem can be explained as follows:
1. SinkWriterOperator now requests the latest schema when it receives the first non-createTableEvent schema change event (assuming there is no schema in the local cache).
2. The schema manager applies the schema change after confirming flush success.
3. Assume that the sequence after failover is to process a schema change event first, followed by a data change event.
On the schema manager side, the schema manager will apply the schema change event to its cached schema(i.e. evolvedSchema) after confirming a successful flush.
On the SinkWriterOperator side, the processing flow is:
1) Handle the flushEvent;
2) Handle the schema change event (in this step, the latest schema will be fetched from the schema manager and sent downstream; then the schema change event will be emitted) – note that this step does not report an error;
3) Handle the data change – here the failover occurs because the data record column size does not match the schema.