Affects Version/s: Discovery API 1.0.0
Fix Version/s: None
The Sling Discovery API introduces the abstraction of a topology which contains (Sling) clusters and instances, supports liveliness-detection, leader-election within a cluster and property-propagation between the instances. As a default and reference implementation a resource-based, OOTB implementation was created (org.apache.sling.discovery.impl).
Pros and cons of the discovery.impl
Although the discovery.impl supports everything required in discovery.api, it has a few limitations. Here's a list of pros and cons:
No additional software required (leverages repository for intra-cluster communication/storage and HTTP-REST calls for cross-cluster communication)
Very small footprint
Perfectly suited for a single clusters, instance and for small, rather stable hub-based topologies
Config-/deployment-limitations (aka embedded-limitation): connections between clusters are peer-to-peer and explicit. To span a topology, a number of instances must (be made) know (to) each other, changes in the topology typically requires config adjustments to guarantee high availability of the discovery service
Except if a natural "hub cluster" exists that can serve as connection point for all "satellite clusters"
Other than that, it is less suited for large and/or dynamic topologies
Change propagation (for topology parts reported via connectors) is non-atomic and slow, hop-by-hop based
No guarantee on order of TopologyEvents sent in individual instances - ie different instances might see different orders of TopologyEvents (ie changes in the topology) but eventually the topology is guaranteed to be consistent
Robustness of discovery.impl wrt storm situations depends on robustness of underlying cluster (not a real negative but discovery.impl might in theory unveil repository bugs which would otherwise not have been a problem)
Rather new, little tested code which might have issues with edge cases wrt network problems
although partitioning-support is not a requirement per se, similar edge-cases might exist wrt network-delays/timing/crashes
Reusing a suitable 3rd party library
To provide an additional option as implementation of the discovery.api one idea is to use a suitable 3rd party library.
The following is a list of requirements a 3rd party library must support:
liveliness detection: detect whether an instance is up and running
stable leader election within a cluster: stable describes the fact that a leader will remain leader until it leaves/crashes and no new, joining instance shall take over while a leader exists
stable instance ordering: the list of instances within a cluster is ordered and stable, new, joining instances are put at the end of the list
property propagation: propagate the properties provided within one instance to everybody in the topology. there are no timing requirements bound to this but the intention of this is not to be used as messaging but to announce config parameters to the topology
support large, dynamic clusters: configuration of the new discovery implementation should be easy and support frequent changes in the (large) topology
no single point of failure: this is obvious, there should of course be no single point of failure in the setup
embedded or dedicated: this might be a hot topic: embedding a library has the advantages of not having to install anything additional. a dedicated service on the other hand requires additional handling in deployment. embedding implies a peer-to-peer setup: nodes communicate peer-to-peer rather than via a centralized service. this IMHO is a negative for large topologies which would typically be cross data-centers. hence a dedicated service could be seen as an advantage in the end.
due to need for cross data-center deployments, the transport protocol must be TCP (or HTTP for that matter)