Uploaded image for project: 'Apache Trafodion (Retired)'
  1. Apache Trafodion (Retired)
  2. TRAFODION-2235

Enhance node failure detection and coordination

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.1-incubating
    • 2.1-incubating
    • foundation, installer
    • None
    • Important

    Description

      Certain server and network failures are not detected by the monitor processes which cause a safety net failure detection mechanism to trigger in all Trafodion nodes. This safety net mechanism is controlled by the environment variable SQ_MON_SYNC_TIMEOUT currently set at 15 minutes.

      This JIRA is to enhance the node failure mechanism in the Trafodion foundation components, specifically the monitor process, to detect a non-responsive node and handle it as a node down condition when a configurable timeout event is detected prior to the safety net failure mechanism described above.

      Attachments

        Activity

          People

            zcorrea Zalo Correa
            zcorrea Zalo Correa
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: