Affects Version/s: None
Fix Version/s: None
in our online cluster, we found daughter region's reference file can point to a nonexistent hfile. so when is region is balanced, the region open will be failed as FileNotFoundException, and a lot of errors thrown.
how the problem happen
1. Region R1 is on server S1, and it's has a compaction, say storefile sf1 is compacted into another file at time t1.
2. S1 has a long full gc (in our case about 470s) at t2 (t1 + 300s)
3. R1 is offline from S1 after t2 + 180s, rs zk session expired , so master thought the RS is dead and reassign the R1 to S2.
4. S2 found R1 is too large so it make a split request, and R1 split into R2 + R3, both hold a reference to sf1.
5 . the S1 finish the fullgc at t2 + 470s , and before it report to master, CompactedHFilesDischarger remove the compacted file sf1 from R1 (R1 is still online on Server S1 )
6. so R2、R3 hold a reference to not exists storefile，and lead to the error we came across。
1. write WAL Marker before remove hfile from store
as in SSH, the dead rs log dir is deleted, so write wal marker will be failed.
but is not absolutely reliable， because rs can fullgc after write the marker. there is not way we do these two action ** atomically.
it's not 100% reliable , but it's simple...
2. a possible reliable solution
when remove hfile from store dir, first move it to a RS-Level special DIR, and then move to archived dir.
and we delete the DIR in the SSH，so the remove compacted files will be failed in the first step, it's reliable but complicated.