Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.5.0
-
None
Description
The problem
Primarily affects pluggable (python-based) services.
During cluster installation, there may be a few significant pauses between task execution. At this time, the previous task shows ip as completed at UI, and the next task shows up as not started yet. This effect may be noticed 1-3 times during installation when installing entire cluster, taking in some cases around 3 minutes for one pause.
Initial analysis shows that this time is consumed by executing service checks that has been queued during cluster installation.
Some background:
Server issues a big set of EXECUTION_COMMANDs at once few times during cluster installation. Typically, all commands for one set are sent to agent at once. At agent, status and execution commands are stored at the same queue. While cluster is installed, status commands are appended to the end of the queue. So when the last command for INSTALL is completed, we have a large number of status commands at the queue (hundreds?). Executing them may take around 3 minutes. START commands that have been issued by the server will not be scheduled for execution until all STATUS_COMMANDs at the queue are perform. At UI, installation it looks like installation hang up.
Why it became noticeable at pluggable services:
It's due to few factors:
- python services install faster
- status commands ran a bit slower because we invoke a separate subprocess to determine every status, and also perform more IO
I've attached a relevant log (The interesting part is after text
INFO 2013-12-18 13:43:44,163 Heartbeat.py:76 - Sending heartbeat with response id: 419 and timestamp: 1387374224161. Command(s) in progress: True. Components mapped: True
Zookeeper start has been finished and after that, only status commands have been executing for few minutes (the START task for the next component just showed up as scheduled, but not started yet at UI).
Selected solution
I prefer the approach of checking if the command queue is empty and then picking status commands from last_status. It is better as it can be done every 2 seconds whereas status commands are send by the server only every minute. I assume we still do not store duplicate commands in last_status.
Attachments
Issue Links
- links to