[SLIDER-1236] Unnecessary 10 second sleep before installation - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Noticed when starting LLAP on a 2-node cluster. Slider AM logs:

2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO  agent.AgentProviderService - Registration response: RegistrationResponse{response=OK, responseId=0, statusCommands=null}
...
2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO  agent.AgentProviderService - Registration response: RegistrationResponse{response=OK, responseId=0, statusCommands=null}

Then nothing useful goes on for a while, until:

2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO  agent.AgentProviderService - Installing LLAP on container_1495490227300_0002_01_000002.

If you look at the corresponding logs from both agents, you can see that they both have a gap that's pretty much exactly 10sec.
After the gap, they talk back to AM; after ~30ms for each container (corresponding to the end of its gap), presumably after hearing from it, the AM starts installing LLAP.

INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server with {u'exitstatus': 0,
INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK
INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: {'componentStatus': [], 'reports': []}
INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. Components mapped: False

INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server with {u'exitstatus': 0,
INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK
INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: {'componentStatus': [], 'reports': []}
INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. Components mapped: False

I've observed the same on multiple different clusters.

Attachments

Activity

Ascending order - Click to sort in descending order

Gour Saha added a comment - 17/Aug/17 07:02

I am planning to use a separate constant for the delay after registration and before heartbeat starts and set that to 10ms (really no need for agents to be sitting here for 10secs).

I also observed that the delay between subsequent heartbeats are 10 secs which is fairly high as well. I am going to bring that down to 1 sec. I tested this configuration and all apps I tested with are coming up must faster (in ~7-8 secs all app containers were up and running). All UTs are also passing.

Gour Saha added a comment - 17/Aug/17 07:02 I am planning to use a separate constant for the delay after registration and before heartbeat starts and set that to 10ms (really no need for agents to be sitting here for 10secs). I also observed that the delay between subsequent heartbeats are 10 secs which is fairly high as well. I am going to bring that down to 1 sec. I tested this configuration and all apps I tested with are coming up must faster (in ~7-8 secs all app containers were up and running). All UTs are also passing.

ASF subversion and git services added a comment - 17/Aug/17 07:03

Commit 9b6deb38efe44cd4e5cff26613e318383613d4be in incubator-slider's branch refs/heads/develop from gsaha
[ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=9b6deb3 ]

~~SLIDER-1236~~ Unnecessary 10 second sleep before installation

ASF subversion and git services added a comment - 17/Aug/17 07:03 Commit 9b6deb38efe44cd4e5cff26613e318383613d4be in incubator-slider's branch refs/heads/develop from gsaha [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=9b6deb3 ] SLIDER-1236 Unnecessary 10 second sleep before installation

Billie Rinaldi added a comment - 17/Aug/17 13:15

Hey gsaha, there could be a problem with decreasing the HEARTBEAT_IDDLE_INTERVAL_SEC so much. The Controller uses a count * HEARTBEAT_IDDLE_INTERVAL_SEC to determine when it should re-read the AM location from ZK and re-register with the AM. So now this will be triggered after 3 seconds instead of after 30 seconds.

Billie Rinaldi added a comment - 17/Aug/17 13:15 Hey gsaha , there could be a problem with decreasing the HEARTBEAT_IDDLE_INTERVAL_SEC so much. The Controller uses a count * HEARTBEAT_IDDLE_INTERVAL_SEC to determine when it should re-read the AM location from ZK and re-register with the AM. So now this will be triggered after 3 seconds instead of after 30 seconds.

Gour Saha added a comment - 17/Aug/17 16:27

billie.rinaldi very good point. I tested this scenario as well. Most Slider applications, specifically the power-user ones like LLAP, where they serve end-user requests and run queries in the order of ms are complaining that Slider is actually slowing them down. It is slowing them during start and it is slowing them down during recovery on AM failures (this is where although the app containers are running fine but Ambari continues to report that HiveServer2 Interactive is down). They are also complaining that other competing frameworks are starting applications in the order of 1-2 secs for a reasonable no of containers.

I think Slider is being over cautious here. I personally also think 30 secs is really high. With 30sec delay, 10 apps, all trying to failover to their AM, can quickly create the 3sec scenario. Having said that, I agree it might be an issue in problematic clusters where thousands of AMs fail frequently and simultaneously. I think it might be a good idea to expose a cluster-level configuration for Slider for this property, where admins can set a value which reflects the typical no of applications that run in the cluster. We can also document, suggested values for various ranges of running apps. What do you think billie.rinaldi?

Gour Saha added a comment - 17/Aug/17 16:27 billie.rinaldi very good point. I tested this scenario as well. Most Slider applications, specifically the power-user ones like LLAP, where they serve end-user requests and run queries in the order of ms are complaining that Slider is actually slowing them down. It is slowing them during start and it is slowing them down during recovery on AM failures (this is where although the app containers are running fine but Ambari continues to report that HiveServer2 Interactive is down). They are also complaining that other competing frameworks are starting applications in the order of 1-2 secs for a reasonable no of containers. I think Slider is being over cautious here. I personally also think 30 secs is really high. With 30sec delay, 10 apps, all trying to failover to their AM, can quickly create the 3sec scenario. Having said that, I agree it might be an issue in problematic clusters where thousands of AMs fail frequently and simultaneously. I think it might be a good idea to expose a cluster-level configuration for Slider for this property, where admins can set a value which reflects the typical no of applications that run in the cluster. We can also document, suggested values for various ranges of running apps. What do you think billie.rinaldi ?

Billie Rinaldi added a comment - 17/Aug/17 17:42

It might be easiest to make it configurable through the agent.ini file (though configuring that is relatively inconvenient). The number of apps that are trying to reconnect to their AM is less important than the number of containers. Is it still okay when one AM fails in an app that has a large number of containers? Will the AM easily support 10x as many heartbeats when there are a large number of containers? What happens if the AM hasn't failed, and it just wasn't responding for 3 seconds? (I would imagine the AM is robust to unnecessary re-registration, but am not familiar with that part of the code.)

It also appears that there are random sleeps between 0 and 30 seconds if an Exception is thrown from a registration or heartbeat (presumably the sleep is randomized so that subsequent attempts are staggered when there are a lot of agents). I don't know what could cause the Exception, but perhaps the 30 should be lowered a bit as well?

Billie Rinaldi added a comment - 17/Aug/17 17:42 It might be easiest to make it configurable through the agent.ini file (though configuring that is relatively inconvenient). The number of apps that are trying to reconnect to their AM is less important than the number of containers. Is it still okay when one AM fails in an app that has a large number of containers? Will the AM easily support 10x as many heartbeats when there are a large number of containers? What happens if the AM hasn't failed, and it just wasn't responding for 3 seconds? (I would imagine the AM is robust to unnecessary re-registration, but am not familiar with that part of the code.) It also appears that there are random sleeps between 0 and 30 seconds if an Exception is thrown from a registration or heartbeat (presumably the sleep is randomized so that subsequent attempts are staggered when there are a lot of agents). I don't know what could cause the Exception, but perhaps the 30 should be lowered a bit as well?

Gour Saha added a comment - 17/Aug/17 22:31

Right, the total no of containers is the real key here and technically not the no of apps. I used apps as a count, keeping a standard of set of containers for each app in mind. We can explicitly refer to no of containers, although if there are N containers across 1 app or 10 apps, chances of containers continuing to hit ZK is higher in the 10 apps scenario, since it would take 10 AMs (instead of 1) to recover, to stop all N containers to stop hitting ZK.

While I was experimenting for an ideal value for HEARTBEAT_IDDLE_INTERVAL_SEC, at one point I had brought it down to 10 ms (=0.01) and the app came up fine. The agents had multiple no of NO_OP heartbeat logs between every 2 heartbeats in which it did something meaningful like INSTALL and START. So the AM-Agent protocol works fine with the reduced wait. However, I think we need to test with an app with 100s of containers (or more) to validate that the AM robustness does not degrade.

I think we need more testing on the reduction of heartbeat intervals from 10sec -> 1 sec. Also, as you mentioned there are several other exception code paths where there is a potential for up to 30 secs wait which are not addressed. In that case I think I will create a separate jira and address all the code blocks which are affected by HEARTBEAT_IDDLE_INTERVAL_SEC. We need to ensure an app recovers fast for all failure scenarios. Let's keep this jira for only the wait between registration and the first heartbeat start, as this bug was actually filed for.

Let me first back out the change where I reduced the value of HEARTBEAT_IDDLE_INTERVAL_SEC from 10 to 1.

What do you think billie.rinaldi?

Gour Saha added a comment - 17/Aug/17 22:31 Right, the total no of containers is the real key here and technically not the no of apps. I used apps as a count, keeping a standard of set of containers for each app in mind. We can explicitly refer to no of containers, although if there are N containers across 1 app or 10 apps, chances of containers continuing to hit ZK is higher in the 10 apps scenario, since it would take 10 AMs (instead of 1) to recover, to stop all N containers to stop hitting ZK. While I was experimenting for an ideal value for HEARTBEAT_IDDLE_INTERVAL_SEC, at one point I had brought it down to 10 ms (=0.01) and the app came up fine. The agents had multiple no of NO_OP heartbeat logs between every 2 heartbeats in which it did something meaningful like INSTALL and START. So the AM-Agent protocol works fine with the reduced wait. However, I think we need to test with an app with 100s of containers (or more) to validate that the AM robustness does not degrade. I think we need more testing on the reduction of heartbeat intervals from 10sec -> 1 sec. Also, as you mentioned there are several other exception code paths where there is a potential for up to 30 secs wait which are not addressed. In that case I think I will create a separate jira and address all the code blocks which are affected by HEARTBEAT_IDDLE_INTERVAL_SEC. We need to ensure an app recovers fast for all failure scenarios. Let's keep this jira for only the wait between registration and the first heartbeat start, as this bug was actually filed for. Let me first back out the change where I reduced the value of HEARTBEAT_IDDLE_INTERVAL_SEC from 10 to 1. What do you think billie.rinaldi ?

Sergey Shelukhin added a comment - 17/Aug/17 22:36

Hmm... given that only the first heartbeat is blocking the startup, is it possible to heartbeat as soon as the container is ready, and then apply the wait? This would unblock the AM to install.

Sergey Shelukhin added a comment - 17/Aug/17 22:36 Hmm... given that only the first heartbeat is blocking the startup, is it possible to heartbeat as soon as the container is ready, and then apply the wait? This would unblock the AM to install.

Gour Saha added a comment - 17/Aug/17 23:30

sershe the first heartbeat reduction from 10secs to 10ms is not being removed. So the install start will happen fast. Subsequent heartbeats which were reduced to 1sec frequency will go back to 10secs.

Gour Saha added a comment - 17/Aug/17 23:30 sershe the first heartbeat reduction from 10secs to 10ms is not being removed. So the install start will happen fast. Subsequent heartbeats which were reduced to 1sec frequency will go back to 10secs.

Billie Rinaldi added a comment - 18/Aug/17 15:55

I think it would be okay to leave in the change of HEARTBEAT_IDDLE_INTERVAL_SEC from 10 to 1 here, but I agree we should open an additional ticket to evaluate failure scenarios further.

Billie Rinaldi added a comment - 18/Aug/17 15:55 I think it would be okay to leave in the change of HEARTBEAT_IDDLE_INTERVAL_SEC from 10 to 1 here, but I agree we should open an additional ticket to evaluate failure scenarios further.

People

Assignee:: Gour Saha

Reporter:: Sergey Shelukhin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Aug/17 00:34

Updated:: 18/Aug/17 15:55

Resolved:: 17/Aug/17 07:04