I don't think you can assume a normal distribution for latency. I think it looks more Zipfian in practice, or maybe bi-modal because of cache misses. Also, a 5% error on a 95th percentile is kind of huge; IIUC, that means it's actually reporting between the 90th and 100th percentile.  by the same authors as your link discusses sampling for high-percentiles.
I found  which I think is well-suited for our use case, since it can do approximate quantiles on a sliding time window. Space and time bounds seems to be O(reasonable log factors). Somehow mashing up  to use  would be most optimal, but doing just  is probably okay too.