Sanjay, thanks for your comments! I need to look more at
HDFS-2832, but I think we've got some nice overlap. Particularly, I agree that cache would be just another DN Storage.
Block reports will indicate the storage type.
I'm ok with this, but our initial design proposes separate heartbeats since cache reports might want to tick on a different interval. You might want quick cache reports if datanodes are doing their own LRU, but maybe you'd want to adaptively throttle it back if the NN is under load, since cache report processing can be expensive.
Separate heartbeats per-storage could definitely be added in later, so consider this a later-stage optimization.
NN will order the replicas locations based on closeness and speed;
This is tricky because it depends on the network topology and workload. I don't want my single cached replica to get hammered by the entire cluster, but perhaps going to in-rack memory is better than local disk. I figure clients should be able to provide a configurable policy to their DFSClients.
I think we also still need the isCached flag for scheduling. Hypothetically, MR might always want to place on a memory replica over a disk replica. So, we could sort memory replicas first, then disk replicas. However, this squishes the existing ordering based on network topology used by DFSClients, and all our DFSClients end up hammering the cached replica at read time.
Note that even without a smarter DFSClient, we can get a lot of benefit just by making schedulers place tasks for memory-locality since our big win is going to be local memory reads. Colin's working on this in
NN will not count Ram replicas towards the normal replica count - this is one area where the ram replicas are treated differently.
This can support a usage model where the ram replicas are at each or only some of the disk replica locations.
+1, let's design for a future where cache might not be disk-backed. As Colin notes above, memory HSM is not easy, but the code should be flexible.