Yep, I've been looping a custom version of the HDFS-nothing-safe test that among other things, only does adds, no deletes.
Update: when I reverted my custom changes to the chaos test (so that it also did deletes), I got a high amount of shard-out-of-sync errors... seemingly even more than before, so I've been trying to track those down. What I saw were issues that did not look related to PeerSync... I saw missing documents from a shard that replicated from the leader while buffering documents, and I saw the missing documents come in and get buffered, pointing to transaction log buffering or replay issues.
Then I realized that I had tested "adds only" before committing, and tested the normal test after committing and doing a "git pull". In-between those times was
SOLR-8575, which was a fix to the HDFS tlog! I've been looping the test for a number of hours with those changes reverted, and I haven't seen a shards-out-of-sync fail so far. I've also done a quick review of SOLR-8575, but didn't see anything obviously incorrect. The changes in that issue may just be uncovering another bug (due to timing) rather than causing one... too early to tell.
I've also been running the non-hdfs version of the test for over a day, and also had no inconsistent shard failures.