Uploaded image for project: 'Apache Trafodion (Retired)'
  1. Apache Trafodion (Retired)
  2. TRAFODION-2651

The monitor to monitor process communication cannot handle a network reset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3
    • foundation
    • None

    Description

      The monitor to monitor socket communication does not have reconnect logic to handle a network reset or transient network errors.

      Analysis:

      • During a ~20 second network reset window, no errors are detected by open sockets
      o Open sockets are dead, but there is no indication from the TCP/IP stack that socket is in an error condition
      • Once the network is restored, a CONNECTIONLOSS is reported by the Zookeeper Client Library.
      o However, reconnect logic reestablishes connection with quorum.
      • At EPOLL expiration time, EPOLL logic report “Not heard from peer=n” and treats peer as Node Down.
      o The node down logic deletes corresponding znode, CZClient::WatchNodeDelete()
      o All monitor processes continually check for expired znodes for each node in the cluster, including their own znode
       An expired znode is handled as a down node

      Attachments

        Issue Links

          Activity

            People

              zcorrea Zalo Correa
              zcorrea Zalo Correa
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: