[ZOOKEEPER-4306] CloseSessionTxn contains too many ephemal nodes cause cluster crash - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.6.2
Fix Version/s: None
Component/s: None
Labels:
- pull-request-available

Description

We took a test about how many ephemal nodes can client create under one parent node with defalut configuration. The test caused cluster crash at last, exception stack trace like this.

follower:

leader:

It seems that leader sent a too large txn packet to followers. When follower try to deserialize the txn, it found the txn length out of its buffer size(default 1MB+1MB, jute.maxbuffer + jute.maxbuffer.extrasize). That causes followers crashed, and then, leader found there was no sufficient followers synced, so leader shutdown later. When leader shutdown, it called zkDb.fastForwardDataBase() , and leader found the txn read from txnlog out of its buffer size, so it crashed too.

After the servers crashed, they try to restart the quorum. But they would not success because the last txn is too large. We lose the log at that moment, but the stack trace is same as this one.

Root Cause

We use org.apache.zookeeper.server.LogFormatter(-Djute.maxbuffer=74827780) visualize this log and found this. So closeSessionTxn contains all ephemal nodes with absolute path. We know we will get a large getChildren respose if we create too many children nodes under one parent node, that is limited by jute.maxbuffer of client. If we create plenty of ephemal nodes under different parent nodes with one session, it may not cause out of buffer of client, but when the session close without delete these node first, it probably cause cluster crash.

Is it a bug or just a unspecified feature？If it just so, how should we judge the upper limit of creating nodes?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cs.jpg
28/May/21 02:53
553 kB
Lin Changrui
f.jpg
28/May/21 02:31
55 kB
Lin Changrui
l1.png
28/May/21 02:31
9 kB
Lin Changrui
l2.jpg
28/May/21 02:32
108 kB
Lin Changrui
r.jpg
28/May/21 02:40
317 kB
Lin Changrui

Issue Links

is caused by

ZOOKEEPER-3145 Potential watch missing issue due to stale pzxid when replaying CloseSession txn with fuzzy snapshot

Resolved

is related to

ZOOKEEPER-4874 Avoid using fuzzy snapshots.

Open

links to

GitHub Pull Request #1716

GitHub Pull Request #2201

Activity

People

Assignee:: Unassigned

Reporter:: Lin Changrui

Votes:: 5 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 28/May/21 03:09

Updated:: 07/Nov/24 06:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 10m