Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
None
Description
Noticed when starting LLAP on a 2-node cluster. Slider AM logs:
2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO agent.AgentProviderService - Registration response: RegistrationResponse{response=OK, responseId=0, statusCommands=null} ... 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO agent.AgentProviderService - Registration response: RegistrationResponse{response=OK, responseId=0, statusCommands=null}
Then nothing useful goes on for a while, until:
2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO agent.AgentProviderService - Installing LLAP on container_1495490227300_0002_01_000002.
If you look at the corresponding logs from both agents, you can see that they both have a gap that's pretty much exactly 10sec.
After the gap, they talk back to AM; after ~30ms for each container (corresponding to the end of its gap), presumably after hearing from it, the AM starts installing LLAP.
INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server with {u'exitstatus': 0, INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: {'componentStatus': [], 'reports': []} INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. Components mapped: False INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server with {u'exitstatus': 0, INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: {'componentStatus': [], 'reports': []} INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. Components mapped: False
I've observed the same on multiple different clusters.
I am planning to use a separate constant for the delay after registration and before heartbeat starts and set that to 10ms (really no need for agents to be sitting here for 10secs).
I also observed that the delay between subsequent heartbeats are 10 secs which is fairly high as well. I am going to bring that down to 1 sec. I tested this configuration and all apps I tested with are coming up must faster (in ~7-8 secs all app containers were up and running). All UTs are also passing.