[FLINK-36884] GCS 503 Error Codes for `flink-checkpoints/<id>/shared/` file after upload complete - ASF JIRA

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Add vote

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.18.0
Fix Version/s: None
Component/s: FileSystems, Runtime / Checkpointing, Runtime / REST
Labels:
None
Environment:

We are using Flink 1.18.0 with the gs-plugin.

It is a rare bug but something we have noticed multiple times.

Description

We had a Flink pipeline that started to, all of a sudden, fail on a single subtask [Image 1]. It does not block the rest of the DAG checkpointing so the checkpoint barriers are continuing on.

We investigated the issue and found that the checkpoint was trying to write over and over and over. It retried writing the file thousands of times. And the issue persisted across checkpoints and savepoints but only failed for one specific file.

An example log:

Dec 10, 2024 6:06:05 PM com.google.cloud.hadoop.util.RetryHttpInitializer$LoggingResponseHandler handleResponse
INFO: Encountered status code 503 when sending PUT request to URL 'https://storage.googleapis.com/upload/storage/v1/b/<bucket>/o?ifGenerationMatch=0&name=flink-checkpoints/2394318276860454f7b6d1689f770796/shared/7d6bb60b-e0cf-4873-afc1-f2d785a4418e&uploadType=resumable&upload_id=<upload_id>'. Delegating to response handler for possible retry.
...

It is important to note that the file was in fact there. I am not sure if it was complete however it was not an .inprogress.file so I believe it was complete.

I even tried deleting the file in GCS and waiting for a new checkpoint to occur and the same issue persisted.

There is no issue when we restarted the job from a savepoint. There seems to be only an issue with a very specific file.

I also tried it locally. It had a 503 from this endpoint with the same upload_id

https://storage.googleapis.com/upload/storage/v1/<bucket>

However worked fine with this API (with a new upload_id)

https://storage.googleapis.com/<path>

I could not find the merged file on the Task Manager to try from the pod when it was failing.

Attachments

Screenshot 2024-12-10 at 1.46.06 PM.png
10/Dec/24 18:46
168 kB
Ryan van Huuksloot

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Ryan van Huuksloot

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Dec/24 18:56

Updated:: 20/Dec/24 03:27

Agile

View on Board

GCS 503 Error Codes for `flink-checkpoints/<id>/shared/` file after upload complete

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment