Unused imports in FsDatasetImpl and FsVolumeImpl
Do we still need to rename getExecutor to getCacheExecutor in FsVolumeImpl?
Well, the name of the variable is cacheExecutor; shouldn't the getter be getCacheExecutor?
State#isUncaching() is unused
Could use a core pool size of 0 for uncachingExecutor, I don't think it's that latency sensitive
usedBytes javadoc: "more things to cache that we can't actually do because of" is an awkward turn of phrase, maybe say "assign more blocks than we can actually cache because of" instead
MappableBlock#load javadoc: visibleLeng parameter should be renamed to length. The return value is now also a MappableBlock, not a boolean.
Key: rename id to blockId for clarity? or add a bit of javadoc
Naming the HashMap replicaMap is confusing since there's already a datanode ReplicaMap class. Maybe mappableBlockMap instead?
Caching can fail if the underlying block is invalidated in between getting the block's filename and running the CacheTask. It'd be nice to distinguish this race from a real error for when we do metrics (and also quash the exception).
I just added a catch block for the FileNotFound exception which both getBlockInputStream and getMetaDataInputStream can throw. I still think we want to log this exception, but at INFO rather than WARN. We will retry sending the DNA_CACHE command (once 5366 is committed), so hitting this narrow race if a block is being moved is just a temporary setback.
If we get a DNA_CACHE for a block that is currently being uncached, shouldn't we try to cancel the uncache and re-cache it? The NN will resend the command, but it'd be better to not have to wait for that.
We don't know how far along the uncaching process is. We can't cancel it if we already called munmap. We could allow cancellation of pending uncaches by splitting UNCACHING into UNCACHING_SCHEDULED and UNCACHING_IN_PROGRESS, and only allowing cancellation on the former. This might be a good improvement to make as part of 5182. But for now, the uncaching process is really quick, so let's keep it simple.
Could this be written with value.state == State.CACHING_CANCELLED instead? Would be clearer, and I believe equivalent since uncacheBlock won't set the state to UNCACHING if it's CACHING or CACHING_CANCELLED.
well, if value is null, you don't want to be dereferencing that, right?
Even better would be interrupting a CachingTask on uncache since it'll save us I/O and CPU.
That kind of interruption logic gets complex quickly. I'd rather save that for a potential performance improvement JIRA later down the line. I also think that if we're thrashing (cancelling caching requests right and left) the real fix might be on the NameNode anyway...
Could we combine CACHING_CANCELLED into UNCACHING? It seems like CachingTask could check for UNCACHING in that if statement at the end and uncache, same sort of change for uncacheBlock.
I would rather not do that, since right now we can look at entries in the map and instantly know that anything in state UNCACHING has an associated Runnable scheduled in the Executor. cancelled is not really the same thing as uncaching since in the former case, there is actually nothing to do!
I think using a switch/case on the prevValue.state in uncacheBlock would be clearer
6,000,000 milliseconds seem like very long test timeouts Can we change them to say, 60,000?
the general idea is to do stuff that can time out in GenericTestUtils#waitFor blocks. The waitFor blocks actually give useful backtraces and messages when they time out, unlike the generic test timeouts. I wanted to avoid the scenario where the test-level timeouts kick in, but out of paranoia, I set the overall test timeout to 10 minutes in case there was some other unexpected timeout. I wanted to avoid the issues we've had with zombie tests in Jenkins causing heisenfailures.
Are these new log prints for sanity checking? Maybe we can just remove them.
it's more so you can see what's going on in the sea of log messages. otherwise, it becomes hard to debug.
Some of the comments seem to refer to a previous patch version that used a countdown latch.
It's unclear what this is testing beyond caching and then uncaching a bunch of blocks. Can we check for log prints to see that it's actually cancelling as expected? Any other ideas for definitively hitting cancellation?
we could add callback hooks to more points in the system, and set up a bunch of countdown latches (or similar), but it might be overkill here. I have looked at the logs and I do see cancellation. you'd have to hit a GC or something to avoid it in this test, I think