[CASSANDRA-6747] MessagingService should handle failures on remote nodes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Low
Resolution: Fixed
Fix Version/s: 2.1 beta2
Component/s: None
Labels:
- Core

Description

While going through the code of MessagingService, I discovered that we don't handle callbacks on failure very well. If a Verb Handler on the remote machine throws an exception, it goes right through uncaught exception handler. The machine which triggered the message will keep waiting and will timeout. On timeout, it will so some stuff hard coded in the MS like hints and add to Latency. There is no way in IAsyncCallback to specify that to do on timeouts and also on failures.

Here are some examples which I found will help if we enhance this system to also propagate failures back. So IAsyncCallback will have methods like onFailure.

1) From ActiveRepairService.prepareForRepair

IAsyncCallback callback = new IAsyncCallback()
{
@Override
public void response(MessageIn msg)

{ prepareLatch.countDown(); }

@Override
public boolean isLatencyForSnitch()

{ return false; }

};

List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size());
for (ColumnFamilyStore cfs : columnFamilyStores)
cfIds.add(cfs.metadata.cfId);

for(InetAddress neighbour : endpoints)

{ PrepareMessage message = new PrepareMessage(parentRepairSession, cfIds, ranges); MessageOut<RepairMessage> msg = message.createMessage(); MessagingService.instance().sendRR(msg, neighbour, callback); }

try

{ prepareLatch.await(1, TimeUnit.HOURS); }

catch (InterruptedException e)

{ parentRepairSessions.remove(parentRepairSession); throw new RuntimeException("Did not get replies from all endpoints.", e); }

2) During snapshot phase in repair, if SnapshotVerbHandler throws an exception, we will wait forever.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

6747-v3.txt
07/Apr/14 21:42
17 kB
Yuki Morishita
CASSANDRA-6747.diff
03/Apr/14 22:17
15 kB
Sankalp Kohli
CASSANDRA-6747-v2.diff
04/Apr/14 16:10
15 kB
Sankalp Kohli

Issue Links

is duplicated by

CASSANDRA-7783 Snapshot repairs can hang forever

Resolved

is related to

CASSANDRA-7560 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

Resolved

CASSANDRA-7886 Coordinator should not wait for read timeouts when replicas hit Exceptions

Resolved

Activity

People

Assignee:: Sankalp Kohli

Reporter:: Sankalp Kohli

Authors:: Sankalp Kohli

Reviewers:: Yuki Morishita

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Feb/14 00:10

Updated:: 16/Apr/19 09:31

Resolved:: 09/Apr/14 02:38