This patch serves to extend metrics on supervisor and worker. Currently the following metrics are being implemented, including but not limited to:
- Kill Count by Category - Assignment Change/HB too old/Heap Space
- Time spent in each state
- Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - per worker?
- Time to start worker for topology from reading assignment for the first time.
- Worker cleanup Time/Worker cleanup Retries
- Worker Suicide Count - category: internal error or Assignment Change
- Supervisor restart Count
- Blobstore (Request to download time)
- # Download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish
- # Download rate individual blob (inside localizer)
- # Supervisor localizer thread blob download - how long (outside localizer)
- Blobstore Update due to Version change Cnts
- Blobstore Storage by users
- Avg/Max Time to respond to Http Request
There might be more metrics added later.
This patch will also refactor code in relevant files. Bugs found during the process will be reported in other issues and handled separately.