UIMA
  1. UIMA
  2. UIMA-1298

A shared remote CM hangs when one of its clients runs out of memory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3AS
    • Fix Version/s: 2.3.1AS
    • Component/s: Async Scaleout
    • Labels:
      None

      Description

      Twice I observed that when one client aggregate of a shared remote CM crashed with an out-of-memory exception the service stopped responding to the other client's requests. No errors found in the service log. The client was not using the service at the time of the crash. Requests stacked up on the input queue ... almost as if the service was blocked on an empty pool, or ...? Killing a client (cntl-C) did not cause the hang. Weird!

        Activity

        Hide
        Jerry Cwiklik added a comment -

        This indicates that the client that crashed did not send requests to free CASes. Most likely all CM CASes have been drained and the service is hung on an empty pool. How do you suggest to fix this? Using a timer to time freeCas requests seems difficult to get right.

        Show
        Jerry Cwiklik added a comment - This indicates that the client that crashed did not send requests to free CASes. Most likely all CM CASes have been drained and the service is hung on an empty pool. How do you suggest to fix this? Using a timer to time freeCas requests seems difficult to get right.
        Hide
        Marshall Schor added a comment -

        defer beyond 2.3.0 release

        Show
        Marshall Schor added a comment - defer beyond 2.3.0 release
        Hide
        Jerry Cwiklik added a comment -

        Consider spinning a thread in the CM service that periodically checks outgoing connections to detect failed clients. Perhaps sending a Ping message to a client can be used to detect stale connections and dead clients.

        Show
        Jerry Cwiklik added a comment - Consider spinning a thread in the CM service that periodically checks outgoing connections to detect failed clients. Perhaps sending a Ping message to a client can be used to detect stale connections and dead clients.
        Hide
        Jerry Cwiklik added a comment -

        Deferred to the next release

        Show
        Jerry Cwiklik added a comment - Deferred to the next release
        Hide
        Jerry Cwiklik added a comment -

        Changing scope of the problem to a scenario where the client of the CM terminates abruptly. In this case, detect the failure while sending produced (child) CAS to the client and force release of ALL outstanding CASes. Also, since the client is no longer reachable the CM should be stopped and the input CAS released as well. The CasIterator.release() should be called to stop the CM.

        Solving the hung CM client is more difficult. Using timer to release CASes is error prone since there is no way to easily determine what the timeout should be.

        Show
        Jerry Cwiklik added a comment - Changing scope of the problem to a scenario where the client of the CM terminates abruptly. In this case, detect the failure while sending produced (child) CAS to the client and force release of ALL outstanding CASes. Also, since the client is no longer reachable the CM should be stopped and the input CAS released as well. The CasIterator.release() should be called to stop the CM. Solving the hung CM client is more difficult. Using timer to release CASes is error prone since there is no way to easily determine what the timeout should be.
        Hide
        Jerry Cwiklik added a comment -

        Modified to stop CM from producing new CASes if a send() fails while trying to deliver a CAS to a client. All outstanding CASes (those that have been sent to a client but not explicitly freed), will be released. The parent CAS will be unlocked and subsequently releases to the service's Cas Pool. This only fix addresses a scenario where the client's reply queue is removed.

        Show
        Jerry Cwiklik added a comment - Modified to stop CM from producing new CASes if a send() fails while trying to deliver a CAS to a client. All outstanding CASes (those that have been sent to a client but not explicitly freed), will be released. The parent CAS will be unlocked and subsequently releases to the service's Cas Pool. This only fix addresses a scenario where the client's reply queue is removed.

          People

          • Assignee:
            Jerry Cwiklik
            Reporter:
            Burn Lewis
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development