Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-742

Investigate IO stalls caused by long reads

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • M5
    • None
    • perf, tablet
    • None

    Description

      One thing we noticed after adding ksck checksum scanners to YCSB is that 10% of the jobs die because one RPC timed out. The jobs run after the "ycsb run" phase and there's usually a huge compactions backlog, so those long checksum scans happen concurrently with writes. Not only are RPCs slowed down, but we also see compactions going from completing in 4 secs to taking up to 3 minutes.

      Todd suspects a "fairly well known behavior where a constant read workload can delay ext4 checkpoints from happening" and pointed to http://www.spinics.net/lists/linux-ext4/msg25761.html and https://lkml.org/lkml/2011/6/9/647.

      It would be nice to get a better understanding of where we're blocking. If we're starving writes, why are reads also blocked? Adding more tracing would help.

      Attachments

        Issue Links

          Activity

            People

              tlipcon Todd Lipcon
              jdcryans Jean-Daniel Cryans
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: