Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-742

Investigate IO stalls caused by long reads

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: M5
    • Fix Version/s: None
    • Component/s: perf, tablet
    • Labels:
      None

      Description

      One thing we noticed after adding ksck checksum scanners to YCSB is that 10% of the jobs die because one RPC timed out. The jobs run after the "ycsb run" phase and there's usually a huge compactions backlog, so those long checksum scans happen concurrently with writes. Not only are RPCs slowed down, but we also see compactions going from completing in 4 secs to taking up to 3 minutes.

      Todd suspects a "fairly well known behavior where a constant read workload can delay ext4 checkpoints from happening" and pointed to http://www.spinics.net/lists/linux-ext4/msg25761.html and https://lkml.org/lkml/2011/6/9/647.

      It would be nice to get a better understanding of where we're blocking. If we're starving writes, why are reads also blocked? Adding more tracing would help.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tlipcon Todd Lipcon
                Reporter:
                jdcryans Jean-Daniel Cryans
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: