Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: 0.5.0
    • Fix Version/s: 0.5.0
    • Labels:
      None

      Description

      Use case:

      • at startup or during runtime, s4 nodes are notified of new applications available. The code for these applications is fetched from a remote repository, installed on S4 nodes, and the applications are started automatically.

      How does it work?

      • Zookeeper is used for coordination: when a new app is available, it creates a new znode under /s4-cluster-name/apps/app1
      • S4 nodes are notified of this new znode, which contains the s4r URI as metadata
      • S4 nodes can then fetch the s4r, copy it to a local directory and start it
      • we also need a facility to create the app into Zookeeper, along with required metadata

      For a first milestone, I suggest:

      • a simple file system based repository (can be a distributed file system)
      • deployment only, no unloading

      Later we can add extensions:

      • We could provide various repository clients, depending on the protocol specified, and on the level of trust of the repository (although for a first version, we would just provide a simple mechanism and assume a trustworthy environment).
      • more metada in the application znode, in order to control the state of the app, a time to start / stop, a number of nodes, nodes requirements etc...

      I'll start working on a first implementation, and I'm eager to receive suggestions.

      1. S4-24.patch
        135 kB
        Matthieu Morel

        Issue Links

          Activity

          Matthieu Morel created issue -
          Matthieu Morel made changes -
          Field Original Value New Value
          Parent S4-10 [ 12526968 ]
          Issue Type New Feature [ 2 ] Sub-task [ 7 ]
          Matthieu Morel made changes -
          Assignee Matthieu Morel [ mmorel ]
          Leo Neumeyer made changes -
          Link This issue relates to S4-4 [ S4-4 ]
          Hide
          Matthieu Morel added a comment -

          I implemented the first milestone as described above, in https://github.com/matthieumorel/s4-piper/tree/deployment-manager

          Most of the logic of the deployment manager is factored into the DistributedDeploymentManager class.

          I also included a test in test/java/org/apache/s4/deploy/TestAutomaticDeployment that:

          • uses a simple application packaged with the gradle packaging script from https://github.com/leoneu/s4-piper-app
          • starts an S4 node with a custom module that includes Zookeeper task assignation and the distributed deployment manager
          • creates a znode for the new app, referencing the packaged application file
          • checks that the S4 node detects the new app and correctly starts it.

          Next steps will be:

          • validate the approach
          • improve the declaration of nodes and applications (currently, the only metadata is a URI)
          • provide tooling for publishing a new application
          • test with multiple nodes
          • synchronize among multiple nodes?
          • support multiple protocols for fetching s4r archives
          Show
          Matthieu Morel added a comment - I implemented the first milestone as described above, in https://github.com/matthieumorel/s4-piper/tree/deployment-manager Most of the logic of the deployment manager is factored into the DistributedDeploymentManager class. I also included a test in test/java/org/apache/s4/deploy/TestAutomaticDeployment that: uses a simple application packaged with the gradle packaging script from https://github.com/leoneu/s4-piper-app starts an S4 node with a custom module that includes Zookeeper task assignation and the distributed deployment manager creates a znode for the new app, referencing the packaged application file checks that the S4 node detects the new app and correctly starts it. Next steps will be: validate the approach improve the declaration of nodes and applications (currently, the only metadata is a URI) provide tooling for publishing a new application test with multiple nodes synchronize among multiple nodes? support multiple protocols for fetching s4r archives
          Hide
          Matthieu Morel added a comment -

          After getting valuable feedback from Kishore, I added some improvements to the deployment manager.

          I also added support for fetching S4Rs through HTTP (implemented with Netty).

          Updates are published here : https://github.com/matthieumorel/s4-piper/tree/deployment-manager

          Leo, note that I kept an intermediate step through a temporary file when fetching S4Rs. Indeed, passing around file streams is possible but complexifies the code for maintaining and closing those streams. In addition, this module is just for loading applications and it has no impact on the performance of the applications, thus I don't see a strong need for skipping that intermediate step, at least for now.

          Show
          Matthieu Morel added a comment - After getting valuable feedback from Kishore, I added some improvements to the deployment manager. I also added support for fetching S4Rs through HTTP (implemented with Netty). Updates are published here : https://github.com/matthieumorel/s4-piper/tree/deployment-manager Leo, note that I kept an intermediate step through a temporary file when fetching S4Rs. Indeed, passing around file streams is possible but complexifies the code for maintaining and closing those streams. In addition, this module is just for loading applications and it has no impact on the performance of the applications, thus I don't see a strong need for skipping that intermediate step, at least for now.
          Hide
          Leo Neumeyer added a comment -

          Matthieu, Kishore:

          Is there any scenario under which a node can get out of sync? That is, not having exactly the same set of applications loaded. The most important requirement, I think, is to make sure all failures are atomic so we make it impossible for nodes to have different code. The nice thing about the symmetric approach is that code in all nodes is always the same. The dynamic loading is the only weak point that can break symmetry so we should make sure it is super robust. I assume ZK can always know that all the apps were successfully loaded, and if not, consider the node failed.

          great stuff!
          -leo

          Show
          Leo Neumeyer added a comment - Matthieu, Kishore: Is there any scenario under which a node can get out of sync? That is, not having exactly the same set of applications loaded. The most important requirement, I think, is to make sure all failures are atomic so we make it impossible for nodes to have different code. The nice thing about the symmetric approach is that code in all nodes is always the same. The dynamic loading is the only weak point that can break symmetry so we should make sure it is super robust. I assume ZK can always know that all the apps were successfully loaded, and if not, consider the node failed. great stuff! -leo
          Hide
          Matthieu Morel added a comment -

          Yes, that's a possible case, that's why we'll have to synchronize among multiple nodes.

          The idea would be that:
          1. when a new app is available, all nodes must first load this app
          2. then when all nodes have loaded the app, (and that it has been verified in some way) , init and start routines can proceed. If all nodes cannot load the app after a given timeout, we should rollback and unload the app.
          3. during runtime, if a node fails, the failover procedure should bring a warm node to the cluster, without the need to re-fetch the application. This implies that standby nodes must also load the apps. And that the synchronization among nodes and apps should also include those standby nodes, somehow.

          Show
          Matthieu Morel added a comment - Yes, that's a possible case, that's why we'll have to synchronize among multiple nodes. The idea would be that: 1. when a new app is available, all nodes must first load this app 2. then when all nodes have loaded the app, (and that it has been verified in some way) , init and start routines can proceed. If all nodes cannot load the app after a given timeout, we should rollback and unload the app. 3. during runtime, if a node fails, the failover procedure should bring a warm node to the cluster, without the need to re-fetch the application. This implies that standby nodes must also load the apps. And that the synchronization among nodes and apps should also include those standby nodes, somehow.
          Hide
          Leo Neumeyer added a comment -

          Nice!

          Once we start supporting an elastic cluster we will have to add functionality to mirror the state of the loaded apps in a new node. This shouldn't be a problem because the node can always be initialized using information from the ZK tables.

          Show
          Leo Neumeyer added a comment - Nice! Once we start supporting an elastic cluster we will have to add functionality to mirror the state of the loaded apps in a new node. This shouldn't be a problem because the node can always be initialized using information from the ZK tables.
          Hide
          kishore gopalakrishna added a comment -

          Nice Matthieu, will try it this week. I forgot to mention about undeploy. Currently we compare apps in ZK v/s apps are deployed to find new apps. We should find the apps that are deployed but not in ZK and undeploy them. Not a big deal but we need it.

          We can also put the metadata that app is loaded on the ZK ephemeral node. That ways we can check if the dependent app is loaded on all the nodes.

          Show
          kishore gopalakrishna added a comment - Nice Matthieu, will try it this week. I forgot to mention about undeploy. Currently we compare apps in ZK v/s apps are deployed to find new apps. We should find the apps that are deployed but not in ZK and undeploy them. Not a big deal but we need it. We can also put the metadata that app is loaded on the ZK ephemeral node. That ways we can check if the dependent app is loaded on all the nodes.
          Matthieu Morel made changes -
          Link This issue depends on S4-4 [ S4-4 ]
          Hide
          Matthieu Morel added a comment -

          applications are loaded using the dynamic application deployment mechanism

          Show
          Matthieu Morel added a comment - applications are loaded using the dynamic application deployment mechanism
          Matthieu Morel made changes -
          Attachment S4-24.patch [ 12509569 ]
          Hide
          Matthieu Morel added a comment -

          Here is a patch with a complete implementation.

          It includes:

          • deployment of packaged applications from :
            • local apps dir for the node
            • remote file system
            • remote web server (through http)

          Apps loading order is:

          1. an S4 node loads locally available apps,
          2. downloads remote apps
          3. loads remote apps

          Downloading, then loading of remote apps is triggered by zookeeper (you add a znode with the remote location of the S4R in /<cluster>/apps/<my new app>. Deployment can be triggered at any time.

          In addition, I also completely automated the building of test apps. This means that test apps are in the /test-apps/ directory. Each test app lies in its own directory, and is automatically built and packaged using a gradle script (adapted from Leo's initial version, but not requiring pre-installed S4 platform jars).

          Currently those scripts are copied among the directories, and we could find a way to factor them out.

          In order to clarify this contribution (there is quite a bit of code, and I had to provide many changes to keep up with API changes in the piper branch), I rebased my changes and squashed them into a single commit.

          You must apply it like this:

          git am S4-24.patch --ignore-whitespace
          Show
          Matthieu Morel added a comment - Here is a patch with a complete implementation. It includes: deployment of packaged applications from : local apps dir for the node remote file system remote web server (through http) Apps loading order is: an S4 node loads locally available apps, downloads remote apps loads remote apps Downloading, then loading of remote apps is triggered by zookeeper (you add a znode with the remote location of the S4R in /<cluster>/apps/<my new app>. Deployment can be triggered at any time. In addition, I also completely automated the building of test apps. This means that test apps are in the /test-apps/ directory. Each test app lies in its own directory, and is automatically built and packaged using a gradle script (adapted from Leo's initial version, but not requiring pre-installed S4 platform jars). Currently those scripts are copied among the directories, and we could find a way to factor them out. In order to clarify this contribution (there is quite a bit of code, and I had to provide many changes to keep up with API changes in the piper branch), I rebased my changes and squashed them into a single commit. You must apply it like this: git am S4-24.patch --ignore-whitespace
          Hide
          Leo Neumeyer added a comment -

          Matthieu, this looks great. I have a question. How do you guarantee that all nodes have exactly the same apps deployed? By same apps I mean exactly the same binary. We need to make sure it is impossible to have inconsistencies between nodes. This requires that each node acknowledges that the app was successfully deployed and only then instruct the server to start the app. We should put as many as hurdles as needed before starting an app. It's always better to fail the launch than to launch with inconsistencies.

          Can you explain the potential inconsistency in deployNewApps(Set<String> newApps) ?

          Still going through this, more comments later.

          -leo

          Show
          Leo Neumeyer added a comment - Matthieu, this looks great. I have a question. How do you guarantee that all nodes have exactly the same apps deployed? By same apps I mean exactly the same binary. We need to make sure it is impossible to have inconsistencies between nodes. This requires that each node acknowledges that the app was successfully deployed and only then instruct the server to start the app. We should put as many as hurdles as needed before starting an app. It's always better to fail the launch than to launch with inconsistencies. Can you explain the potential inconsistency in deployNewApps(Set<String> newApps) ? Still going through this, more comments later. -leo
          Hide
          Matthieu Morel added a comment -

          How do you guarantee that all nodes have exactly the same apps deployed?

          This would be part of the extensions to the deployment mechanism, as proposed initially (see description). I don't see it as a blocking issue and the idea of S4-24 was really to get something working first, then improve.
          To implement this feature, we'll need extra coordination, and integrity checks of the binaries (MD5 for instance)

          Can you explain the potential inconsistency in deployNewApps(Set<String> newApps) ?

          Indeed, that seems to have slipped through. I should rather rethrow the interruption and cancel the deployment process. I'll update the patch.

          Show
          Matthieu Morel added a comment - How do you guarantee that all nodes have exactly the same apps deployed? This would be part of the extensions to the deployment mechanism, as proposed initially (see description). I don't see it as a blocking issue and the idea of S4-24 was really to get something working first, then improve. To implement this feature, we'll need extra coordination, and integrity checks of the binaries (MD5 for instance) Can you explain the potential inconsistency in deployNewApps(Set<String> newApps) ? Indeed, that seems to have slipped through. I should rather rethrow the interruption and cancel the deployment process. I'll update the patch.
          Hide
          Matthieu Morel added a comment -

          updated and pushed to the S4-24 branch https://git-wip-us.apache.org/repos/asf?p=incubator-s4.git;a=log;h=refs/heads/S4-24

          I'll merge that code later to the piper tree so we can go forward, unless there are objections

          Show
          Matthieu Morel added a comment - updated and pushed to the S4-24 branch https://git-wip-us.apache.org/repos/asf?p=incubator-s4.git;a=log;h=refs/heads/S4-24 I'll merge that code later to the piper tree so we can go forward, unless there are objections
          Hide
          Matthieu Morel added a comment -

          merged into piper branch:
          https://git-wip-us.apache.org/repos/asf?p=incubator-s4.git;a=commitdiff;h=5e38aa2202a1955416d9f9adf3590fb86d40cb4a

          For next milestone, we'll want:

          • validation of apps
          • synchronization among cluster nodes before application startup
          • abort in case of error, timeout etc...
          Show
          Matthieu Morel added a comment - merged into piper branch: https://git-wip-us.apache.org/repos/asf?p=incubator-s4.git;a=commitdiff;h=5e38aa2202a1955416d9f9adf3590fb86d40cb4a For next milestone, we'll want: validation of apps synchronization among cluster nodes before application startup abort in case of error, timeout etc...
          Matthieu Morel made changes -
          Link This issue is related to S4-77 [ S4-77 ]
          Hide
          Matthieu Morel added a comment -

          We have a working version in the piper branch. I am therefore resolving this issue so that further improvements can now be discussed in S4-77

          Show
          Matthieu Morel added a comment - We have a working version in the piper branch. I am therefore resolving this issue so that further improvements can now be discussed in S4-77
          Matthieu Morel made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Implemented [ 10 ]
          Tony Stevenson made changes -
          Workflow jira [ 12641865 ] no-reopen-closed, patch-avail [ 12711371 ]
          Gavin made changes -
          Link This issue depends on S4-4 [ S4-4 ]
          Gavin made changes -
          Link This issue depends upon S4-4 [ S4-4 ]

            People

            • Assignee:
              Matthieu Morel
              Reporter:
              Matthieu Morel
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development