In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures.
When a machine name contains "-", it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue:
deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss)
You can see that replication use those weird paths constructed from deadRegionServers to check a file existence
This happened in the recent test failure in http://18.104.22.168/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
After 10 times retries, replication source gave up and move on to the next file. Data loss happens.
Since lots of EC2 machine names contain "-" including our Jenkin servers, this is a high impact issue.