[CASSANDRA-9491] Inefficient sequential repairs against vnode clusters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Low
Resolution: Unresolved
Fix Version/s: None
Component/s: Consistency/Repair
Labels:
- remove-reopen

Description

I've got a cluster with vnodes enabled. People regularly run sequential repairs against that cluster.
During such a sequential repair (just nodetool -pr, statistics show:

huge increase of live-sstable-count (approx doubling the amount),
huge amount of memtable-switches (approx 1200 per node per minute),
huge number of flushed (approx 25 per node per minute)
memtable-data-size drops to (nearly) 0
huge amount of compaction-completed-tasks (60k per minute) and compacted-bytes (25GB per minute)

These numbers do not match the real, tiny workload that the cluster really has.

The reason for these (IMO crazy) numbers is the way how sequential repairs work on vnode clusters:
Starting at StorageService.forceRepairAsync (from nodetool -pr, a repair on the ranges from getLocalPrimaryRanges(keyspace) is initiated. I'll express the schema in pseudo-code:

ranges = getLocalPrimaryRanges(keyspace)
foreach range in ranges:
{
	foreach columnFamily
	{
		start async RepairJob
		{
			if sequentialRepair:
				start SnapshotTask against each endpoint (including self)
				send tree requests if snapshot successful
			else // if parallel repair
				send tree requests
		}
	}
}

This means, that for each sequential repair, a snapshot (including all its implications like flushes, tiny sstables, followup-compactions) is taken for every range. That means 256 snapshots per column-family per repair on each (involved) endpoint. For about 20 tables, this could mean 5120 snapshots within a very short period of time. You do not realize that amount on the file system, since the tag for the snapshot is always the same - so all snapshots end in the same directory.

IMO it would be sufficient to snapshot only once per column-family. Or do I miss something?

So basically changing the pseudo-code to:

ranges = getLocalPrimaryRanges(keyspace)

foreach columnFamily
{
	if sequentialRepair:
		start SnapshotTask against each endpoint (including self)
}
foreach range in ranges:
{
	start async RepairJob
	{
		send tree requests (if snapshot successful)
	}
}

NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, trunk)

EDIT: corrected target pseudo-code

Attachments

Issue Links

is superceded by

CASSANDRA-5220 Repair improvements when using vnodes

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Robert Stupp

Reviewers:: Yuki Morishita

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 27/May/15 12:35

Updated:: 16/Apr/19 09:31