Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-5060

DUCC Orchestrator (OR) "warm" restart issues

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0-Ducc
    • Fix Version/s: 2.2.0-Ducc
    • Component/s: DUCC
    • Labels:
      None

      Description

      Address issues pertaining to the ability to shutdown and restart the OR component without loss of active jobs or services or AP's (aka Managed Reservations).

        Activity

        Hide
        lou.degenaro Lou DeGenaro added a comment -

        code is delivered.

        Show
        lou.degenaro Lou DeGenaro added a comment - code is delivered.
        Hide
        lou.degenaro Lou DeGenaro added a comment -
        • remove sequenceNumberStateAbbreviated from orchestrator-state.json, since it is obsolete
        Show
        lou.degenaro Lou DeGenaro added a comment - remove sequenceNumberStateAbbreviated from orchestrator-state.json, since it is obsolete
        Hide
        lou.degenaro Lou DeGenaro added a comment - - edited
        • for restoration of next service sequence number, use the greater of seqno in state/sm.properties and registry data from database comprising services.
        • log a WARNing in sm.log if registry data is used.
        Show
        lou.degenaro Lou DeGenaro added a comment - - edited for restoration of next service sequence number, use the greater of seqno in state/sm.properties and registry data from database comprising services. log a WARNing in sm.log if registry data is used.
        Hide
        lou.degenaro Lou DeGenaro added a comment -
        • add to DUCC Book a chapter in the Admin Guide about the state directory, describing the sub-directories and files therein
        Show
        lou.degenaro Lou DeGenaro added a comment - add to DUCC Book a chapter in the Admin Guide about the state directory, describing the sub-directories and files therein
        Hide
        lou.degenaro Lou DeGenaro added a comment -
        • for restoration of next job/reservation sequence number, use the greater of seqno in state/orchestrator.properties and historical data from database comprising jobs and reservations.
        • log a WARNing in or.log if historical data is used.
        Show
        lou.degenaro Lou DeGenaro added a comment - for restoration of next job/reservation sequence number, use the greater of seqno in state/orchestrator.properties and historical data from database comprising jobs and reservations. log a WARNing in or.log if historical data is used.
        Hide
        lou.degenaro Lou DeGenaro added a comment -
        • reduce OR dependencies on CommonConfiguration
        • employ DuccPropertiesResolver
        Show
        lou.degenaro Lou DeGenaro added a comment - reduce OR dependencies on CommonConfiguration employ DuccPropertiesResolver
        Hide
        lou.degenaro Lou DeGenaro added a comment -
        • Fix leak of JD entries in OR's map of processId-to-jobId.
        • Make ProcessToJobMap its own class and insure code that employs same does not use stale copy.
        • Code refactoring for simplification and clarity in ProcessAccounting.
        Show
        lou.degenaro Lou DeGenaro added a comment - Fix leak of JD entries in OR's map of processId-to-jobId. Make ProcessToJobMap its own class and insure code that employs same does not use stale copy. Code refactoring for simplification and clarity in ProcessAccounting.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        Enhance stop_ducc and, for symmetry, start_ducc to support option -c head, where head represents

        { or, rm, pm, sm, ws, db, broker }

        . That is, all top-level DUCC daemons except agents.

        Show
        lou.degenaro Lou DeGenaro added a comment - Enhance stop_ducc and, for symmetry, start_ducc to support option -c head, where head represents { or, rm, pm, sm, ws, db, broker } . That is, all top-level DUCC daemons except agents.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        Previously OR supported --cold and --warm starts, where the former caused all active work to be moved to the Completed state (no recovery) and the latter caused partial recovery (active Jobs forced to Completed, while active Reservations recovered). We abandon introducing a third --hot option that would implement a full recovery and instead re-purpose --warm to do so. Partial recovery is deemed unnecessary and somewhat arbitrary.

        Show
        lou.degenaro Lou DeGenaro added a comment - Previously OR supported --cold and --warm starts, where the former caused all active work to be moved to the Completed state (no recovery) and the latter caused partial recovery (active Jobs forced to Completed, while active Reservations recovered). We abandon introducing a third --hot option that would implement a full recovery and instead re-purpose --warm to do so. Partial recovery is deemed unnecessary and somewhat arbitrary.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        Communication related exceptions were due to misconfiguration of user-defined test services, employing port 1000 instead of 1099 for jmx port.

        Show
        lou.degenaro Lou DeGenaro added a comment - Communication related exceptions were due to misconfiguration of user-defined test services, employing port 1000 instead of 1099 for jmx port.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        Note: when OR is down for extended period (e.g. > 1 minute) it shows as down on DuccMon, as do SM and RM since their publications are based on receipt of OR publication.

        Show
        lou.degenaro Lou DeGenaro added a comment - Note: when OR is down for extended period (e.g. > 1 minute) it shows as down on DuccMon, as do SM and RM since their publications are based on receipt of OR publication.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        3. Jobs survival

        • submitted job comprising 101 sleeper work items
        • waited for job to enter Initializing state
        • stopped OR & waited 5 minutes (initialization is 2 minutes)
        • started OR hot & observed that Job continued to next state Assigned
        • stopped & started OR several times while 4 JPs were Initializing & Running
        • noticed that WS Processes tab had incomplete information for a short while after OR was re-started
        • Job ran to Completed state successfully
        • user logs are normal
        • noticed that several daemons had communication-related exceptions
        Show
        lou.degenaro Lou DeGenaro added a comment - 3. Jobs survival submitted job comprising 101 sleeper work items waited for job to enter Initializing state stopped OR & waited 5 minutes (initialization is 2 minutes) started OR hot & observed that Job continued to next state Assigned stopped & started OR several times while 4 JPs were Initializing & Running noticed that WS Processes tab had incomplete information for a short while after OR was re-started Job ran to Completed state successfully user logs are normal noticed that several daemons had communication-related exceptions
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        2. Services survival

        • created UIMA-AS FixedSleep AE services
        • made one autostart with 2 instances & started
        • made one maunal with 1 instance & started
        • issued "stop_ducc -c or" & "start_ducc -c or --hot" several times
        • insured services were still up
        • stopped both services successfully
        • started both service successfully
        Show
        lou.degenaro Lou DeGenaro added a comment - 2. Services survival created UIMA-AS FixedSleep AE services made one autostart with 2 instances & started made one maunal with 1 instance & started issued "stop_ducc -c or" & "start_ducc -c or --hot" several times insured services were still up stopped both services successfully started both service successfully
        Hide
        lou.degenaro Lou DeGenaro added a comment - - edited

        TESTING with head node + 4 worker nodes

        1. AP survival

        • submitted: viaducc /bin/sleep 300
        • issued "stop_ducc -c or" & "start_ducc -c or --hot" several times
        • insured ORchestrator was down when sleep interval completed
        • issued "start_ducc -c or --hot"
        • checked that AP completed successfully

        Managed Reservation 14 submitted.
        id:14 location:28491@host67
        id:14 state:WaitingForResources
        host350.domain.net
        id:14 remote:21198@host350.domain.net
        id:14 state:Running
        id:14 state:Completed
        id:14 rationale:code=0
        id:14 rc:0

        Show
        lou.degenaro Lou DeGenaro added a comment - - edited TESTING with head node + 4 worker nodes 1. AP survival submitted: viaducc /bin/sleep 300 issued "stop_ducc -c or" & "start_ducc -c or --hot" several times insured ORchestrator was down when sleep interval completed issued "start_ducc -c or --hot" checked that AP completed successfully Managed Reservation 14 submitted. id:14 location:28491@host67 id:14 state:WaitingForResources host350.domain.net id:14 remote:21198@host350.domain.net id:14 state:Running id:14 state:Completed id:14 rationale:code=0 id:14 rc:0
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        Job Drivers (JDs) should not log "raw" exceptions when unable to communicate with the Orchestrator (OR). Instead, add to log: "Status reporting stopped. Condition may be temporary." and when condition is cleared (if ever) add to log: "Status reporting resumed.".

        Show
        lou.degenaro Lou DeGenaro added a comment - Job Drivers (JDs) should not log "raw" exceptions when unable to communicate with the Orchestrator (OR). Instead, add to log: "Status reporting stopped. Condition may be temporary." and when condition is cleared (if ever) add to log: "Status reporting resumed.".

          People

          • Assignee:
            lou.degenaro Lou DeGenaro
            Reporter:
            lou.degenaro Lou DeGenaro
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development