[ZOOKEEPER-4394] Learner.syncWithLeader got NullPointerException - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.7.0
Fix Version/s: 3.10.0, 3.9.3
Component/s: server
Labels:
- pull-request-available
Environment:

ZooKeeper 3.7.0

Language:
- Java

Description

ZooKeeper follower node encountered NullPointerException during syncWithLeader.

Logs indicate that the follower has received NEWLEADER packet between a PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the later if-statement raised NPE.

case Leader.COMMIT:
case Leader.COMMITANDACTIVATE:
    pif = packetsNotCommitted.peekFirst();
    if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() == Leader.COMMITANDACTIVATE) {
        // ...
    }

After look into the Leader side, I found:

LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog (PROPOSAL/COMMIT pairs);
Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
LeanerHandler.run sends NEWLEADER message.

Seams if the outstandingProposals is not empty at the certain moment, the follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.

The follower will retry from LOOKING again and is expected to be succeed at last, however, under heavy load it may be too many retries. Further more, I my case the follower has to sync data from leader's disk, and start over again after the NPE(prior sync not flushed?), which may harm the leader.

I don't know if it is designed so or not, but consider the performance, can we at least avoid wasting of network/disk IO?

Attachments

Issue Links

relates to

ZOOKEEPER-4643 Committed txns may be improperly truncated if follower crashes right after updating currentEpoch but before persisting txns to disk

Resolved

supercedes

ZOOKEEPER-3023 Flaky test: org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff

Resolved

ZOOKEEPER-4646 Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk

Resolved

ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

Resolved

links to

GitHub Pull Request #1930

GitHub Pull Request #2152

GitHub Pull Request #2188

(2 links to)

Activity

People

Assignee:: Unassigned

Reporter:: Liu Haifeng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Oct/21 08:47

Updated:: 24/Oct/24 20:09

Resolved:: 19/Sep/24 15:34

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m