The case of a local short circuit read having access to the open file is interesting... does this pin the memory until the possibly misbehaved client process closes the socket / FD?
Single replicas? Why would one want to triple replicate discardable memory? One should at least have the option to only keep a single local copy in HDFS.
If we can not prevent random access writes to DDM (we could presumably limit this in client API), then I don't think we can checksum or replicate until a file is closed. My gut is delaying such until close is the right call...
How are discarded or lost (node fails) blocks / files handled? Do the names remain in the NN and get reported in FSCK and other operations? We want to be sure this doesn't add work to operators.
Can we make these files transient like ZK ephemeral nodes?
Once one assumes you don't need to replicate discardable files, then one can think about allocating only an arena name (think directory) in the NN and then creating individual files only at the DN, limiting NN interaction. This would be a lot faster. (You could still have remote access via .../<ARENA>/<DN-NAME>/<name> style URLs.) With this you could vastly reduce NN interactions, which is probably good for latency reduction and scalability. You could then imagine using this mechanism for MR / Tez / Spark shuffle files ... which has been a long term project goal... Maybe we should break this idea out into another JIRA... ? happy to chat if folks want to flesh this out.
Involving Yarn in HDFS resource management is interestingly circular. Is this needed? One would want the right abstraction to allow other solutions to be applied to Yarnless deployments.