Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-963

Add timers to help identify performance issues with KV stores and producers.

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.11
    • Fix Version/s: 0.11.0
    • Component/s: None
    • Labels:
      None

      Description

      We have good timing metrics for many of the primary actions in the event loop:

      • Choose
        • Deserialization
        • Poll
      • Process
      • Window
      • Commit

      I've noticed a few things while analyzing job performance at LinkedIn:
      1. We can usually identify problems in Choose using the sub metrics for Deserialization and Poll. I don't think any work needs to be done here.

      2. Slowness in Process or Window is usually caused by business logic (e.g. side calls to remote DBs), but it can also be caused by slowness (e.g. "stalls" in the case of RocksDB) in the KV Store.

      3. Slowness in Commit can be caused by slowness flushing the stores or producers. It can also come from checkpointing.

      #2 would be better if we had timers around all the main KV Store operations, including get, put, delete, and the batch operations. Then we can isolate KV Store performance from business logic performance.

      #3 would be improved if we had timers around all the flushes. Specifically, I think we should add a "flush-ns" metric to the KeyValueStoreMetrics and update it from each of the stores. I noticed that KafkaSystemProducerMetrics has a "flush-ns" metric, so the kafka producer is covered.

      To summarize, this ticket is to add metrics around all KV Store operations, not just for user operations like get/put, but flush as well.

      Related work: SAMZA-449

      1. SAMZA-963.1.patch
        10 kB
        Fred Ji
      2. SAMZA-963.2.patch
        9 kB
        Fred Ji

        Activity

        Hide
        fredji Fred Ji added a comment -

        Discussed with Jake Maes and Yi Pan (Data Infrastructure), we will add timer at KeyValueStorageEngine to capture the latency at the upper level instead of the lower level for each raw store.
        The metrics we are going to add include:
        get,
        put,
        delete,
        flush,
        all,
        range

        Show
        fredji Fred Ji added a comment - Discussed with Jake Maes and Yi Pan (Data Infrastructure) , we will add timer at KeyValueStorageEngine to capture the latency at the upper level instead of the lower level for each raw store. The metrics we are going to add include: get, put, delete, flush, all, range
        Hide
        fredji Fred Ji added a comment -
        Show
        fredji Fred Ji added a comment - RB submitted: https://reviews.apache.org/r/50619/
        Hide
        nickpan47 Yi Pan (Data Infrastructure) added a comment -

        Merged and submitted. Thanks!

        Show
        nickpan47 Yi Pan (Data Infrastructure) added a comment - Merged and submitted. Thanks!

          People

          • Assignee:
            fredji Fred Ji
            Reporter:
            jmakes Jake Maes
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development