[KAFKA-15414] remote logs get deleted after partition reassignment - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.6.0
Component/s: None
Labels:
None

Description

it seems I'm reaching that codepath when running reassignments on my cluster and segment are deleted from remote store despite a huge retention (topic created a few hours ago with 1000h retention).
It seems to happen consistently on some partitions when reassigning but not all partitions.

My test:

I have a test topic with 30 partition configured with 1000h global retention and 2 minutes local retention
I have a load tester producing to all partitions evenly
I have consumer load tester consuming that topic
I regularly reset offsets to earliest on my consumer to test backfilling from tiered storage.

My consumer was catching up consuming the backlog and I wanted to upscale my cluster to speed up recovery: I upscaled my cluster from 3 to 12 brokers and reassigned my test topic to all available brokers to have an even leader/follower count per broker.

When I triggered the reassignment, the consumer lag dropped on some of my topic partitions:
Screenshot 2023-08-28 at 20 57 09

Later I tried to reassign back my topic to 3 brokers and the issue happened again.

Both times in my logs, I've seen a bunch of logs like:

[RemoteLogManager=10005 partition=uR3O_hk3QRqsn4mPXGFoOw:loadtest11-17] Deleted remote log segment RemoteLogSegmentId

{topicIdPartition=uR3O_hk3QRqsn4mPXGFoOw:loadtest11-17, id=Mk0chBQrTyKETTawIulQog}

due to leader epoch cache truncation. Current earliest epoch: EpochEntry(epoch=14, startOffset=46776780), segmentEndOffset: 46437796 and segmentEpochs: [10]

Looking at my s3 bucket. The segments prior to my reassignment have been indeed deleted.