Thanks Colin for your summarization, I'd like to try to address some of your concerns and questions:
security / correctness concerns: it's easy to make a mistake that could bring down the NameNode or entire FS
I agree that's the cost, the developer must be careful and guarantee the code quality.
non-generality to systems using s3 or another FS in addition to HDFS
Yes, It's only applicable to HDFS. I guess snapshot is only applicable to HDFS too (I could be wrong here as I haven't read snapshot code), so it shouldn't bring much confusion.
issues with federation (which NN does the cleanup? How do you decide?)
Each NN only takes care of the cleanup of files/directories in its own namespace. Let's consider TTL as an attribute attached to files/directories, no much difference under federation or non-federation configuration.
complexities surrounding our client-side Trash implementation and our server-side snapshots
No much difference whether implemented inside NN or outside NN.
configuration burden on sysadmins
We need to think about the total cost of ownership. Implementing inside NN increases HDFS's own configuration burden for sure, but implementing in a separate system just moves the burden from HDFS to a new system, it would have higher total cost in general.
inability to change the cleanup code without restarting the NameNode
Yes, that's the cost, but should be minor. Users might change TTL policies frequently according to their requirements, but the cleanup code shouldn't change frequently (unless the implementation code is crappy).
HA concerns (need to avoid split-brain or lost updates)
That's a good question. We haven't thought over this, seems the cleanup code should only run at the active NN as standby doesn't have the latest updates and can't initiate edits.
It shouldn't introduce split-brain as it doesn't change NN's core flow, but should be implemented carefully anyway.
error handling (where do users find out about errors?)
I haven't thought of any runtime errors (at cleanup stage) that need to be notified of the end users. It should be the sys admin who cares about errors at this stage and he/she could figure them out in logs. For the errors when users set TTL through command line or APIs, the users should be notified directly.
semantics: disappearing or time-limited files is an unfamiliar API, not like the traditional FS APIs we usually implement
Firstly, no much difference whether implemented inside NN or outside NN. Moreover, if only users have the requirements of TTL-based cleanup, it shouldn't be difficult for them to accept an API.
Making this pluggable doesn't fix any of those problems, and it adds some more:
The motivation isn't fixing possible problems of implementing TTL policy in the server-side, it's trying to separate the mechanism from specific jobs. It provides an elegant approach to implement such an extension to NN and makes the common part of such extensions reusable.
The only points I've seen raised in favor of doing this in the NameNode are:...
IMHO, the major points for doing this in NN are:
- it's a more natural way for end-users, they don't have to interact with HDFS directly in most cases but resort to another system for TTL requirement.
- lower cost for maintenance (possibly lower cost for implementation too, but it depends on current status of NN).
To the second point, HBase doesn't use coprocessors for cleanup jobs... it uses them for things like secondary indices, a much better-defined problem.
HBase coprocessor is just an analogy... Possibly not a good one but I can't think of a better one right now. HBase could use coprocessor for cleanup jobs. HBase's default cleanup policies are "Number of Versions" and "TTL", which are configured per Column Family. If you have a special requirement to clean up cells per its content, for example using the value of a specific column as the "Number of Versions" to keep, you could do it using coprocessor. You could do the same thing in a MR job for sure. I'm not saying using coprocessor is a good practice in general but for some use cases, it might be.
A little bit background about what we are doing: both Zesheng and I are from Xiaomi, a fast growing mobile internet company in China. We are in a team to support data infrastructure of the company using the open-sourced Hadoop ecosystem and our role might be similar to some teams in Facebook. We do improvements to the open-sourced software per the requirements from our products and would like to contribute our improvements back to community. We have contributed pretty a few patches to HBase community and two members of our team, Liang Xie and Honghua Feng became HBase committers recently. We improve HDFS at the same time and are also happy to collaborate with the community.
For this specific feature proposal, a NN-side TTL implementation and a general NN extension mechanism, its feasibility isn't very clear to us as it's just an idea so far. We'd like to spend time on investigating its feasibility furthermore. It's still preferable if feasible. If we encounter insurmountable technical challenges, we would give up for sure. So how about keep this jira issue opened right now (we might open another jira issue to track the general NN extension mechanism), and we will get back after we do the investigation? Whatever approach we choose eventually, we always appreciate you guys' help to work out the solution.