Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.0-Ducc
    • Component/s: DUCC
    • Labels:
      None

      Description

      DUCC should provide the capability to move the head node, comprising the broker, database, Orchestrator, PM, RM, SM, and WS, and have the agents seamlessly switch over w/o service disruption.

      In this "static" failover implementation the agents are pre-configured with a list of potential head nodes. Introduced into the ducc.properties file is the key ducc.head.failover whose value is a comma separated list of failover nodes:

      ducc.head.failover = node1, node2, node3...

      The agents at boot time are configured for broker failover to this set of nodes.

      If ducc.head.failover is not specified, then the failover functionality is simply not supported for the installation (e.g. no seamless transition of running agents to an alternate broker head node).

      If ducc.head.failover is specified, then the node specified for ducc.head must appear in this list.

        Activity

        Hide
        lou.degenaro Lou DeGenaro added a comment -

        check_ducc --c should allow head node not in any pool for failover.

        Show
        lou.degenaro Lou DeGenaro added a comment - check_ducc --c should allow head node not in any pool for failover.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        check_ducc --c should insure that the current head node and each potential target (new head) node are:
        a) both in the same node pool, or
        b) both not in any node pool.

        ducc_util.py - add new verify_head_failover_configuration()
        check_ducc - employ verify_head_failover_configuration(). if not compatible complain.

        code is delivered.

        Show
        lou.degenaro Lou DeGenaro added a comment - check_ducc --c should insure that the current head node and each potential target (new head) node are: a) both in the same node pool, or b) both not in any node pool. ducc_util.py - add new verify_head_failover_configuration() check_ducc - employ verify_head_failover_configuration(). if not compatible complain. code is delivered.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        move_ducc should insure that the current head node and the target (new head) node are:
        a) both in the same node pool, or
        b) both not in any node pool.

        NodeConfiguration.java - add new -m <node> flag to request the node pool for the given node
        ducc_util.py - add new get_nodepool(node) utility to invoke NodeConfiguration class to fetch node pool for a given node
        move_ducc - employ get_nodepool(node) for head and target nodes. if not compatible complain and prevent move.

        code is delivered.

        Show
        lou.degenaro Lou DeGenaro added a comment - move_ducc should insure that the current head node and the target (new head) node are: a) both in the same node pool, or b) both not in any node pool. NodeConfiguration.java - add new -m <node> flag to request the node pool for the given node ducc_util.py - add new get_nodepool(node) utility to invoke NodeConfiguration class to fetch node pool for a given node move_ducc - employ get_nodepool(node) for head and target nodes. if not compatible complain and prevent move. code is delivered.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        To perform a move:

        1. issue "stop_ducc --all" on the present ducc.head node
        2. issue "move_ducc" on the new ducc.head node
        3. issue "start_ducc" on the new ducc.head node

        Note: it is possible to employ "stop_ducc -c head" instead to attempt to have existing work continue uninterrupted. The ducc_monitor(s) will not switch over to the new WS location and therefore the WS may kill already running monitored jobs. This issue will be addressed in a later code delivery.

        Show
        lou.degenaro Lou DeGenaro added a comment - To perform a move: 1. issue "stop_ducc --all" on the present ducc.head node 2. issue "move_ducc" on the new ducc.head node 3. issue "start_ducc" on the new ducc.head node Note: it is possible to employ "stop_ducc -c head" instead to attempt to have existing work continue uninterrupted. The ducc_monitor(s) will not switch over to the new WS location and therefore the WS may kill already running monitored jobs. This issue will be addressed in a later code delivery.
        Hide
        lou.degenaro Lou DeGenaro added a comment -

        bash-4.1$ ./ducc_runtime/admin/move_ducc --help
        Usage: move_ducc [options]

        Options:
        -h, --help show this help message and exit
        -d, --debug display debugging messages
        -o, --offline indicate current DUCC head node is offline, Note: USE THIS
        OPTION WITH EXTREME CAUTION else risk corrupting database
        -q, --quiet do not display informational messages

        Run this command on the host which will become the new DUCC head node.

        Prerequisites:
        1. the current ducc.head node in site.ducc.properties is up
        2. the head daemons (broker, database, or, pm, rm, sm, ws) on the ducc.head node are down (e.g. stop_ducc -c head)

        Operation:
        To the extent possible, the cluster will be checked to see if it is safe to edit the site.ducc.properties file, and if so then a backup of the original file is made then the requisite changes are made to realize the head node move.

        Show
        lou.degenaro Lou DeGenaro added a comment - bash-4.1$ ./ducc_runtime/admin/move_ducc --help Usage: move_ducc [options] Options: -h, --help show this help message and exit -d, --debug display debugging messages -o, --offline indicate current DUCC head node is offline, Note: USE THIS OPTION WITH EXTREME CAUTION else risk corrupting database -q, --quiet do not display informational messages Run this command on the host which will become the new DUCC head node. Prerequisites: 1. the current ducc.head node in site.ducc.properties is up 2. the head daemons (broker, database, or, pm, rm, sm, ws) on the ducc.head node are down (e.g. stop_ducc -c head) Operation: To the extent possible, the cluster will be checked to see if it is safe to edit the site.ducc.properties file, and if so then a backup of the original file is made then the requisite changes are made to realize the head node move.
        Hide
        lou.degenaro Lou DeGenaro added a comment -
        • AbstractDuccComponent
          > employ new method composeBrokerFailoverUrl when a proper ducc.head.failover is specified in ducc.properties
          > ducc.broker.url comprises a set of two or more nodes to which DUCC daemons will attempt failover when unable to communicate with the primary node
        • NodeAgent
          > record ducc.broker.url to log when starting
        • OrchestratorComponent
          > record ducc.broker.url to log when starting
        • DuccTransportConfiguration
          > employ ducc.broker.url calculated by AbstractDuccComponent
        Show
        lou.degenaro Lou DeGenaro added a comment - AbstractDuccComponent > employ new method composeBrokerFailoverUrl when a proper ducc.head.failover is specified in ducc.properties > ducc.broker.url comprises a set of two or more nodes to which DUCC daemons will attempt failover when unable to communicate with the primary node NodeAgent > record ducc.broker.url to log when starting OrchestratorComponent > record ducc.broker.url to log when starting DuccTransportConfiguration > employ ducc.broker.url calculated by AbstractDuccComponent
        Hide
        lou.degenaro Lou DeGenaro added a comment -
        • ducc_util.py improved ssh issues detection
          > check ducc.properties for optional ducc.head.failover
          > insure ducc.head listed in ducc.head.failover
          > test viability of ssh to failover nodes
        • add to default.ducc.properties
          > ducc.head.failover = $ {ducc.head}
        Show
        lou.degenaro Lou DeGenaro added a comment - ducc_util.py improved ssh issues detection > check ducc.properties for optional ducc.head.failover > insure ducc.head listed in ducc.head.failover > test viability of ssh to failover nodes add to default.ducc.properties > ducc.head.failover = $ {ducc.head}

          People

          • Assignee:
            lou.degenaro Lou DeGenaro
            Reporter:
            lou.degenaro Lou DeGenaro
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development