if I understand correctly, both the operations of managing the certs (add/remove of certs) and reconfig() API to change members of quorum have to be fault-tolerant.
Would you mind clarifying what you mean by "fault-tolerant" here? Can you give an example of how a fault would break my patch?
Either it be CA(s) with CRL's or self signed list of certs what I am pointing to is that the way an admin manages this information should also support fault-tolerance. Not only it should be fault-tolerant but also should work nicely/easily with most probable next thing an admin would do i.e issue a reconfig() command, it could be an add/removing/modify quorum peer(s) configuration.
It will be nice to provide a way to manage reconfiguration of quorum peers when SSL is enabled with the same weak assumptions that are necessary for reconfig() to work when SSL is not enabled.
Providing a Truststore and asking admins to manage them on their own for the entire quorum will mean that this operation is not fault-tolerant i.e we are expecting them to first set all members of the quorum to a consistent SSL config state and then issue reconfig() command.
It would seem that a set of quorum IP addresses dictate what the current configuration of connectivity is allowed and this has to be managed properly to ensure safety and extending this idea the set of SSL certs(be self signed or CA signed) also dictate the current configuration of connectivity. Hence if one considers the Pair<IP set, SSL set> as config and provide that to reconfig() API it should work. That is what is done for self signed certs in my patch and we should/could provide similar functionality for CA cert case.
Hence there is no new problem to solve here, we piggy back on reconfig() API and provide a single API to manage this, we get fault-tolerance for this configuration and safety that reconfig() provides for free.
Please consider the above comments and let me know what you think, I was not saying that your patch is breaking fault-tolerance instead what my comments pointed to is that we should provide fault-tolerance and safety for reconfiguration of SSL configuration be it self signed or CA based. There are use cases where CA cert based cluster deployment might not be possible hence it would be nice to see Zookeeper provide both possibilities but also maintain the ease of use and provide same guarantees that reconfig() does.
This is how I feel as well. I'm sure we can pretty quickly come up with a list of deficiencies in the current design but I don't think there is anything severe enough at this moment to give us cause to rewrite right now.
There are bugs like ZOOKEEPER-2164, ZOOKEEPER-1678 to consider along with ZOOKEEPER-901. Netty or NIO will work but considering SSL will mean Netty will make it easier to implement.
Doing this in phases is better, getting SSL socket to work with reconfig() support is great first step. The Netty patch I have also gets this support only for FLE and not ZAB. I found it not so easier to abstract away the calls to socket(s) from ZAB code.