Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2762

Improve write transaction tracing



    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10.0
    • Component/s: None
    • Labels:


      Here's a write transaction trace:

      W0331 02:16:27.859648 27337 rpcz_store.cc:244] Call kudu.tserver.TabletServerService.Write from (ReqId={client: 6a904be8604b482989e3d1592f8824f2, seq_no=66108, attempt_no=1}) took 7855ms (client timeout 10000).
      W0331 02:16:27.859728 27337 rpcz_store.cc:248] Trace:
      0331 02:16:20.003835 (+     0us) service_pool.cc:159] Inserting onto call queue
      0331 02:16:20.003844 (+     9us) service_pool.cc:218] Handling call
      0331 02:16:27.859641 (+7855797us) inbound_call.cc:157] Queueing success response
      Related trace 'txn':
      0331 02:16:20.003988 (+     0us) write_transaction.cc:101] PREPARE: Starting
      0331 02:16:20.004087 (+    99us) write_transaction.cc:268] Acquiring schema lock in shared mode
      0331 02:16:20.004088 (+     1us) write_transaction.cc:271] Acquired schema lock
      0331 02:16:20.004088 (+     0us) tablet.cc:400] PREPARE: Decoding operations
      0331 02:16:20.004130 (+    42us) tablet.cc:422] PREPARE: Acquiring locks for 1 operations
      0331 02:16:20.004137 (+     7us) tablet.cc:426] PREPARE: locks acquired
      0331 02:16:20.004138 (+     1us) write_transaction.cc:126] PREPARE: finished.
      0331 02:16:20.004154 (+    16us) write_transaction.cc:136] Start()
      0331 02:16:20.004157 (+     3us) write_transaction.cc:141] Timestamp: P: 1554016580004153 usec, L: 0
      0331 02:16:20.004192 (+    35us) log.cc:582] Serialized 3741 byte log entry
      0331 02:16:27.859496 (+7855304us) write_transaction.cc:149] APPLY: Starting
      0331 02:16:27.859608 (+   112us) tablet_metrics.cc:365] ProbeStats: bloom_lookups=2,key_file_lookups=2,delta_file_lookups=4,mrs_lookups=0
      0331 02:16:27.859614 (+     6us) log.cc:582] Serialized 28 byte log entry
      0331 02:16:27.859622 (+     8us) write_transaction.cc:309] Releasing row and schema locks
      0331 02:16:27.859623 (+     1us) write_transaction.cc:277] Released schema lock
      0331 02:16:27.859625 (+     2us) write_transaction.cc:196] FINISH: updating metrics
      Metrics: {"tcmalloc_contention_cycles":7552,"child_traces":[["txn",{"apply.queue_time_us":7854429,"cfile_cache_hit":12,"cfile_cache_hit_bytes":72423,"delta_iterators_relevant":2,"num_ops":1,"prepare.queue_time_us":11,"prepare.run_cpu_time_us":226,"prepare.run_wall_time_us":225,"raft.queue_time_us":12,"raft.run_cpu_time_us":92,"raft.run_wall_time_us":91,"replication_time_us":898,"spinlock_wait_cycles":18688}]]}

      It could use some polish. Here's a few things to fix:
      1. Nit: the casing and punctuation of PREPARE: finished. and APPLY: Starting is inconsistent.
      2. PREPARE has a start and end; APPLY has a start but no end; replication has neither. We should add the missing trace events.
      3. nit: There's no need for the PREPARE: preamble to trace events within PREPARE: Starting and PREPARE: finished..

      It's potentially misleading to read this trace. Almost all of the time is spent between the trace event reporting the local WAL was written and the trace event reporting apply started. Naively, this should mean the operation took a long time to replicate, as that's the most obvious thing that could take a long time. However, the metrics show the time was taken waiting in the apply queue, held up by some other slow activity. This operation was fast to perform-- it took a long time because it was held up, idling, by something else.




            • Assignee:
              mbarnett Mitch Barnett
              wdberkeley William Berkeley
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: