One of our customers was running SyncTable from a 1.2 based cluster, where SyncTable map tasks were open scanners on a 2.4 based cluster for comparing the two clusters. Few of the map tasks failed with a DoNotRetryException caused by a FileNotFoundException blowing all the way up to the client:
We can see on the RS logs that the above file got recently create as an outcome of a memstore flush, then compaction is triggered shortly:
I believe this is an unlucky scenario where the compaction discharger moved the compacted away files while the StoreFileScanner was getting created but had not updated the refCounter on the file reader yet. We couldn't reproduce this on a real cluster, but I could emulate it with a UT and artificially inducing a delay in the StoreFileScanner creation when creating scans not for compactions. One possible fix is to update the reader refCounter as soon we get the files for the StoreFileScanner instances we are creating.