Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-178

Log/observe snapshot operations

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.5.0
    • Scheduler

    Description

      Currently, snapshot operations of excessive duration aren't necessarily obvious in e.g. the scheduler logs or dashboards. Since this is a potentially critical/dangerous operation (in some cases leading to zookeeper timeouts + scheduler suicide), it would be prudent to expose relevant information more readily (e.g. when the operations commence/complete, timing, etc)

      From Zameer:

      The doSnapshot method of LogStorage is timed with the key "scheduler_log_snapshot". These are the stats it produces:

      scheduler_log_snapshot_events 19
      scheduler_log_snapshot_events_per_sec 0.0
      scheduler_log_snapshot_nanos_per_event 0.0
      scheduler_log_snapshot_nanos_total 373115257383
      scheduler_log_snapshot_nanos_total_per_sec 0.0
      scheduler_log_snapshot_persist_events 19
      scheduler_log_snapshot_persist_events_per_sec 0.0
      scheduler_log_snapshot_persist_nanos_per_event 0.0
      scheduler_log_snapshot_persist_nanos_total 339151517713
      scheduler_log_snapshot_persist_nanos_total_per_sec 0.0
      scheduler_log_snapshots 19

      Which metric should be tracked in our dashboard?

      From Bill F:

      a very long snapshot might never be reflected there if a suicide happens mid-way through. The minimal fix would be to just LOG when a snapshot is about to commence.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jonboulle Jonathan Boulle
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: