Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.1-incubating
-
None
-
Important
Description
Certain server and network failures are not detected by the monitor processes which cause a safety net failure detection mechanism to trigger in all Trafodion nodes. This safety net mechanism is controlled by the environment variable SQ_MON_SYNC_TIMEOUT currently set at 15 minutes.
This JIRA is to enhance the node failure mechanism in the Trafodion foundation components, specifically the monitor process, to detect a non-responsive node and handle it as a node down condition when a configurable timeout event is detected prior to the safety net failure mechanism described above.