Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
Description
During an upgrade of a cluster, some commands are expected to take a very long time depending on what the size of the cluster is and how much data is stored. For example, a NameNode restart with SafeMode exit may take in excess of 30 minutes. On some clusters, this could take less than 1 minute.
Currently today, the only way to adjust these properties is to do so across the board for all commands by editing ambari.properties and setting agent.task.timeout. This solution doesn't work very well since the majority of restarts during an upgrade are not on a master component.
There needs to be a way to instruct Ambari that a restart should be allowed to run for a relatively long period of time.
- Both Java and Python need to be considered here. We don't want Python to give up and return a FAILED state and we don't want Ambari server to set the task to TIMEDOUT.
- This can be useful in both normal restarts and upgrade scenarios.
Upgrade Only
If considering this functionality in the context of an upgrade only, then it is conceivable that this logic can be placed inside of the upgrade XML packs:
<service name="HDFS"> <component name="NAMENODE"> <upgrade> <task xsi:type="restart-task" timeout="1800"/> </upgrade>
- This would allow future mpacks to be able to control the restart of components. Perhaps this can even be slightly abstracted out:
<service name="HDFS"> <component name="NAMENODE"> <upgrade> <task xsi:type="restart-task" timeout="upgrade.parameter.master.restart.long"/> </upgrade> upgrade.parameter.slave.restart.short = 300 upgrade.parameter.slave.restart.long = 900 upgrade.parameter.master.restart.short = 1500 upgrade.parameter.master.restart.long = 1800
Attachments
Attachments
Issue Links
- links to