Affects Version/s: 0.90.0
Fix Version/s: None
We ran some tests on our cluster, and getting back reports about WrongRegionException, on some rows. After looking at the data, we see that we have "gaps" between regions, like this:
Fact: we had 28 regions that were reported with empty HRegionInfo, and deleted from .META..
Fact: we recovered our data entirely, without any issues, by running the .META. restore script from table contents (bin/add_table.rb)
Fact: on our regionservers, we have three days with no logs. To the best of our knowledge, the machines were not rebooted, the processes were running. During these three days, on the master, the only entry in the logs (repeated), every second, is a .META. scan:
In the master logs, we see a pretty normal evolution: region r0 is split into r1 and r2. Now, r1 exists and is good, r2 does not exist in .META. anymore, because it was reported as having empty HRegionInfo. The only thing in the master logs that is weird is that the message about updating the region in meta comes up twice:
Attached you will find the entire forensics work, with explanations, in a text file.
Our entire cluster was in a really weird state. All the regionservers are missing logs for three days, and to the best of our knowledge they were running, and in this time the master has ONLY .META. scan messages, every second, reporting 6 regionservers live, out of 7 total.
Also, during this time, we get filesystem closed messages on a regionservers with one of the missing regions. This is after the gap in the logs.
How we suppose the data in .META. was lost
1. Race conditions in ServerManager / RegionManager. In our logs, we have about 3 or 4 CME, in these classes (see the attached file)
2. Data loss in HDFS. On a regionserver, we get filesystem closed messages
3. Data could not be read fro HDFS ( highly unlikely, there are no weird data read messages)
4. Race condition leading to loss of the HRegionInfo from memory, and then persisted as empty.