[KUDU-742] Investigate IO stalls caused by long reads - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: M5
Fix Version/s: None
Component/s: perf, tablet
Labels:
None

Target Version/s:

Backlog

Description

One thing we noticed after adding ksck checksum scanners to YCSB is that 10% of the jobs die because one RPC timed out. The jobs run after the "ycsb run" phase and there's usually a huge compactions backlog, so those long checksum scans happen concurrently with writes. Not only are RPCs slowed down, but we also see compactions going from completing in 4 secs to taking up to 3 minutes.

Todd suspects a "fairly well known behavior where a constant read workload can delay ext4 checkpoints from happening" and pointed to http://www.spinics.net/lists/linux-ext4/msg25761.html and https://lkml.org/lkml/2011/6/9/647.

It would be nice to get a better understanding of where we're blocking. If we're starving writes, why are reads also blocked? Adding more tracing would help.

Attachments

Issue Links

relates to

KUDU-817 Add stack watchdog and latency metric around reads

Open

Activity

People

Assignee:: Todd Lipcon

Reporter:: Jean-Daniel Cryans

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/May/15 17:20

Updated:: 29/Feb/16 09:15