Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.4.0
-
None
Description
When creating a massive request (a rolling upgrade on a cluster with 1000 nodes), the size of the request seems to slow down the ActionScheduler. Each command was taking between 1 to 2 minutes to run (even server-side tasks).
The cause of this can be seen in the following two stack traces:
at org.apache.ambari.server.orm.dao.DaoUtils.selectList(DaoUtils.java:60) at org.apache.ambari.server.orm.dao.HostRoleCommandDAO.findByPKs(HostRoleCommandDAO.java:293) at org.apache.ambari.server.orm.dao.HostRoleCommandDAO$$EnhancerByGuice$$21789cd1.CGLIB$findByPKs$7(<generated>) at org.apache.ambari.server.orm.dao.HostRoleCommandDAO$$EnhancerByGuice$$21789cd1$$FastClassByGuice$$aa975e7f.invoke(<generated>) at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at org.apache.ambari.server.orm.AmbariLocalSessionInterceptor.invoke(AmbariLocalSessionInterceptor.java:53) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) at org.apache.ambari.server.orm.dao.HostRoleCommandDAO$$EnhancerByGuice$$21789cd1.findByPKs(<generated>) at org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(ActionDBAccessorImpl.java:700) at org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(ActionDBAccessorImpl.java:84) at org.apache.ambari.server.actionmanager.Stage.<init>(Stage.java:157) at org.apache.ambari.server.actionmanager.StageFactoryImpl.createExisting(StageFactoryImpl.java:72) at org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getStagesInProgress(ActionDBAccessorImpl.java:303) at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:341) at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:302) at java.lang.Thread.run(Thread.java:745)
at org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(ActionDBAccessorImpl.java:700) at org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(ActionDBAccessorImpl.java:84) at org.apache.ambari.server.actionmanager.Stage.<init>(Stage.java:157) at org.apache.ambari.server.actionmanager.StageFactoryImpl.createExisting(StageFactoryImpl.java:72) at org.apache.ambari.server.actionmanager.Request.<init>(Request.java:199) at org.apache.ambari.server.actionmanager.Request$$FastClassByGuice$$9071e03.newInstance(<generated>) at com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) at com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:60) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:85) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254) at com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024) at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974) at com.google.inject.assistedinject.FactoryProvider2.invoke(FactoryProvider2.java:632) at com.sun.proxy.$Proxy26.createExisting(Unknown Source) at org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getRequests(ActionDBAccessorImpl.java:784) at org.apache.ambari.server.serveraction.ServerActionExecutor.cleanRequestShareDataContexts(ServerActionExecutor.java:259) - locked <0x00007ff0a14083c8> (a java.util.HashMap) at org.apache.ambari.server.serveraction.ServerActionExecutor.doWork(ServerActionExecutor.java:454) at org.apache.ambari.server.serveraction.ServerActionExecutor$1.run(ServerActionExecutor.java:160) at java.lang.Thread.run(Thread.java:745)
It's clear from these stacks that every PENDING stage (roughly 15,000) were being loaded into memory every second (and their accompanying task as well). This makes no sense as these methods don't need all stages - just the next stage. This is because all stages are synchronous within a single request.
The proposed solution is to fix the StageEntity.findByCommandStatuses call so it doesn't return every stage:
SELECT stage.requestid, MIN(stage.stageid) FROM stageentity stage, hostrolecommandentity hrc WHERE hrc.status IN :statuses AND hrc.stageid = stage.stageid AND hrc.requestid = stage.requestid GROUP BY stage.requestid
Note that this might not appear on trunk due to AMBARI-18868
Attachments
Attachments
Issue Links
- links to