Thanks for the ping; got distracted by other things. And thanks again for the detailed
review. My responses are below. I've generated a patch that shows the differences
between v2 and v4, and also the patch, in a state where it still depends on
is there anything blocking
MAPREDUCE-711 that prevents it from being committed?
Also, sorry about the multiple uploads here. I had a very clever bug in there
(caused by not thinking enough while resolving a merge conflict) that deleted
the current working directory, recursively, in one of the tests. (TaskRunner
is hard-coded to delete current working directory, which is ok, since it's
typically a child process; not ok for LocalJobRunner.)
I've run the "relevant" tests; the full tests take a while, so I'm running those
in the background.
$for i in TestMRWithDistributedCache TestMiniMRLocalFS TestMiniMRDFSCaching TestTrackerDistributedCacheManager; do; ant test
Dtestcase=$i > test-out$i && echo "$i good" || echo "$i bad"; done
There is quite a bit of refactoring in this patch, though I find it really useful.
Yep. Having DistributedCache work locally is easy if you refactor the code
a bit, so that's how I went at it.
Please make sure that in the newly added code, lines aren't longer than 80 characters. For e.g, see DistributedCacheManager.newTaskHandle() method.
A handful of "git diff foo..bar | egrep "+++|+ .
" has done the trick,
I think. The tricky bit is always fixing only the lines I've changed,
and not all the lines in a given file, to preserve history and keep
Just a thought, can the classes be better renamed to reflect their usage, something like TrackerDistributedCacheManager and TaskDistributeCacheManager?
I like those names better; thanks. Changed.
DistributedCacheManager and DistributedCacheHandle: Explicitly state in javadoc that it is not a public interface
This class should also have the variable number argument getLocalCache() methods so that the corresponding methods in DistributedCache can be deprecated. Also, each method in DistributedCache should call the correponding method in DistributedCacheManager class.
Don't think I agree here. We can deprecate the getLocalCache methods
in DistributedCache right away. They delegate to each other, and one of them
delegates to TrackerDistributedCacheManager. Ideally, I'd remove these
altogether — Hadoop internally does not use these methods with
this patch, and there's no sensible reason why someone else would,
but since it's public, it's getting deprecated. But it's not
being deprecated with a pointer to use something else; it's getting
deprecated so that you don't use it at all.
isClassPath can be renamed to shouldBePutInClasspath
Renamed to shouldBeAddedToClasspath.
paths can be renamed to pathsToPutInClasspath.
Renamed to pathsToBeAddedToClasspath
Use .equals method at +150 if (cacheFile.type == CacheFile.FileType.ARCHIVE)
I believe that technically it doesn't matter.
The JDK implementation of equals() on java.lang.Enum is final,
and hardcoded to "this==other". This is the only thing that makes
sense, since there's only ever one instance of a given Enum.
I took an inaccurate look at the code base, and == is the more
- Inaccurate! Not a static analysis! Not even close!
doorstop:hadoop-mapreduce(140142)$ack "([a-zA-Z]\.[a-zA-Z]\.equals" src | wc -l
doorstop:hadoop-mapreduce(140143)$ack "([a-zA-Z]\.[a-zA-Z] ==" src | wc -l
If you feel strongly about this, happy to change it, but I think == is
makeCacheFiles: boolean isArchive -> FileType fileType
I think it would be cleaner to return target instead of passing it as an argument.
makeCacheFiles() method should be documented
setup() This method is really useful, avoids a lot of code duplication!
Leave localTaskFile writing business back in TaskRunner itself. I think It is the task's responsiblity, not the DistributeCacheHandle's
Good call; done.
cacheSubdir can better be an argument to setup() method instead of passing it to the constructor.
Good idea; done.
getClassPaths() : Document that it has to be called and useful only when is already invoked.
Done. I've made it throw an exception if it's called erroneously, since I could
see that causing trouble for developers.
TaskTracker.initialize() A new DistributedCacheManager is created every time, so old files will not be deleted by the subsequent purgeCache. Either we should have only one cacheManager for a TaskTracker or DistributedCacheManager.cachedArchives should be static. The same problem exists with the deprecated purgeCache() method in DistributedCache.java
If my understanding is correct,
in the course of normal operations, TaskTracker.initialize() is only called once, by
TaskTracker.run(). The comment there says:
- The server retry loop.
- This while-loop attempts to connect to the JobTracker. It only
- loops when the old TaskTracker has gone bad (its state is
- stale somehow) and we need to reinitialize everything.
At startup time (and whenever the TaskTracker decides to lose all of its state
and reset itself, which is essentially the same), the TaskTracker initializes
the TrackerDistributedCacheManager object. This is also the only time it's
allowed to "clear" the cache, since there are for certain no tasks running
at initialization time that depend on those.
Part of the whole refactoring here is to avoid static members and methods
as much as possible, because they make it quite tricky to test, deprecate,
and generally play nice.
JobClient.java What happens to the code in here? Should this be refactored too/ are we going to do something about it?
Good question I do think that JobClient should be using some methods in the
filecache package, instead of hard-coding a lot of logic. That said, I chose to
stop somewhere to avoid making this patch even harder to review. I think it's
ripe for a future JIRA.
DistributedCache.java getLocalCache() (+190) should be indirected to DistributedCacheManager.getLocalCache(). Otherwise there is a stack overflow error - the method calls itself as of now.
Oy, good catch.
I think most of the getter/setter methods are deprecable and moveable to DistributedCacheManager. Or at least we should give some thought about it. For most of them, I do see from the javadoc that they are only for internal use anyways.
I agree. A few of them are used to manage the Configuration object. (In my mind, we're
serializing and de-serializing a set of requirements for the distributed cache
into the text configuration, and doing so a bit haphazardly.) I was very tempted
to remove all the ones that are only meant to be internal, but Tom advised me that
I need to keep them deprecated for a version.
Again, I think moving those methods into a more private place is a good task to do
along with changing how JobClient calls into this stuff.
DistributedCacheHandle.makeClassLoader use File.toURI.toURL() instead of File.toURL() directly. File.toURL() is deprecated.
LocalJobRunner.java TaskRunner.setupWorkDir needs to be fixed to have a workDir arg to help the code in LocalJobRunner. Compilation(ant binary) breaks now.
TODO: Test pipes with LocalJobRunner and make sure that it works, i.e. verify
I don't use Pipes typically, and it doesn't seem to compile on my Mac.
I'll try it on a Linux machine, but if it's easy/handy for you, it'd be great
to verify that bug.
TestDistributedCacheManager * Code related to setFileStamps in JobClient.java (+636 to +656) and testManagerFlow() (+71 to +74) can be refactored into an internal-only method in DistributedCacheManager.
I extracted the method and moved it to TrackerDistributedCacheManager. Now that
that has "Tracker" in the name, it's a little misplaced, but it's not too bad.
Minor: assertNonNull(+90) check can be moved after +91 and use localCacheFiles for the null check
TestMRWithDistributedCache Just documenting a TODO: Will need to fix this class's javadoc once
MAPREDUCE-711 is in.
Fix the comment at +84. It is actually two archives.
Please put a comment for the class saying that it is NOT A FAST test, keeping in view the recent efforts to have a fast test target.
Another point. We should consolidate TestMRWithDistributedCache, TestMiniMRLocalFS and TestMiniMRDFSCaching. That's a lot of code in three different files testing mostly common code. May be another JIRA issue if you feel so.
I agree. I didn't find the relevant tests right away, and there are some bits
covered by all of them. I'd prefer a separate JIRA.
That pretty much covers the points you brought up. Take another look?