Jason Lowe, thanks for the great feedback. You pointed out some additional needed details in the design.
I'd also like to clarify a couple of things, and hopefully that will set the stage for answering some of your questions in the process. I see that these points may not have been as clear as I thought they were. Sorry it's quite long.
First, the only consumer of the reader locks (.in_use files) is the cleaner. No other components use these files or check them (clients or node manager). Furthermore, the only time the cleaner looks at these reader locks and the timestamps in the names of these files is to prevent a race, as described in step W2 in the doc.
To determine whether a particular cached entry is stale, the cleaner first looks at the modification time of the directory that contains the cached file (more on the directory modification time later). Then the cleaner writes the cleaner lock to prevent clients from taking action, and proceeds to delete the cached entry. However, there can be a race between the time the cleaner determines the cached entry is stale and after it writes the cleaner lock. It is possible that a client may come in just at the right time and start using this entry between those two points in time. For this reason, the cleaner needs to double check if there is any "recent" attempt to use this file before proceeding to delete. Thus, we're talking about a real recent attempt (a few seconds ago at most). The moment the cleaner detects a recent reader lock, it recognizes this race, and skips this directory.
So the primary reason that these reader locks (and the associated timestamps) are needed is to prevent this race.
Another way of thinking about this is, we could have easily come up with another idea that does not need this timestamp. We could rely on the directory modification time alone instead. The cleaner could check the directory modification time, write the cleaner lock, and then double-check the directory modification time to see if there was any recent attempt to use it in between. In fact, I'm actually thinking I may want to modify the design to take that approach. The timestamp there seems bit confusing in terms of what it is used for. I'll probably make that change.
Now on to why I use the modification time of the directory that contains the cached entry; I'm talking about the modification time of the directory on HDFS. I am using it to detect whether a new client came in and started using the cached entry. Since a client is required to drop the reader lock, any use of the cached entry will update the modification timestamp of the containing directory. So it can be used as a proxy of when the cached entry was last used. This timestamp gets updated one more time if the client removes the reader lock at the end.
I originally considered using another file (like .last_used) to keep track of the last used time of cached entries. However, we were concerned about the impact of adding another file for each cached entry, and thus putting more pressure on the name node. Thus, we settled on looking at the modification time of the containing directory in lieu of that.
The only consumer of this information is also the cleaner.
With this, I'll add my comments to your points/questions.
I'm thinking of the general case of permissions - just because the job client has access to the local files during job submission does not mean the user wants all those files available to anyone with cluster access. It's probably less of an issue in practice if this is limited to just jars, but it's definitely an issue if this is expanded to other distcache file types (e.g.: data files for something like a map-side join).
Agreed. I need to add clarifying details here. If you set the boolean flag, the intent is that it covers only the job jar and the libjars, but it does not apply to other files (for example, -files or -archives are excluded from this). Let me know if you think that's reasonable. I'll add that clarification.
How is the case of orphaned temporary files any different than the orphaned read lock case? I would think the issue of staleness would apply there as well. If a temporary file is over a day old, it's highly likely to be orphaned. Nobody wants to wait a day to upload a distcache entry to HDFS, as it implies it would be on the same order of time to localize it later.
I see it's not explicitly stated, but checkAndUpload() may return the temp file instead of the jar file under some scenarios. This pertains to that error handling I talked about in my previous comment. So, it is possible that jobs may use these temp files directly. This is rather unlikely but in theory it is possible. In this case, if this temp file was prematurely deleted, it may lead to localization failures.
Having said that, I think it may be possible to modify the logic of checkAndUpload() slightly so that it always returns the intended jar. It may make the algorithm bit more involved (as it would entail retrying renaming), but I might be able to make that change. If that can be done, then any closed temp files can be considered safe for clean-up.
Speaking of long-running jobs, an alternative would be to use the YARN application ID (which clients grab just before submitting) as part of the read lock. Then the cleaner can query the ResourceManager to know for certain whether the job is still active.
This is a great point. I think the issue of long-running apps wasn't fully explored in the current version of the design. If there are apps that run beyond the specified staleness value, then the cleaner could erroneously clean up the cached entry, and it may result in localization failures. I think this needs to be addressed.
As you mentioned, probably the best way to solve this cleanly is to add the YARN app id so that the cleaner knows whether the app is active or not.
We originally shied away from using the app id because I wanted to keep the cleaner as minimal as possible with relying only on HDFS. But this may be a compelling reason to introduce the use of the app id. I'll tinker with this and see if it works.
That would not be OK. The last job to initiate a reference on a distcache file is not necessarily going to be the last one to relinquish that reference. Job A starts first but is long-running, and job B starts later but is very quick. We do not want to delete job A's reference because it's older than job B. Otherwise job A could easily fail after job B completes if new tasks (think reducers or failed maps) are later launched on nodes that have not localized those distcache entries yet.
You're right that this type of inversion can happen. But I think there are also mitigating circumstances. This goes back to the observation that the only consumer of the reader locks is the cleaner, and that it looks at them only to determine if there is the aforementioned race. So even if the inversion occurs, it is impactful only if it is combined with this race occuring. I think this could be easily addressed by doing this reader lock clean-up only on recent cached entries.
At any rate, if we end up introducing the app id, this problem may be solved alongside too.
Is the directory timestamp that important? We're localizing files (jars in this case), not directories. A stable timestamp of the file being localized is key to preventing unnecessary re-localization, but I don't see why that would be changing.
I hope I answered this question by explaining why we're using the modification time of the containing directory.