|
I think we should also be able to do this via the web ui, which is very convenient.
There should be a way to make it not black listed any more. It should be persistent across job tracker restarts. It probably should be decommissioning instead of black listing. It should probably start rerunning all of the running and stored tasks. (including the map outputs that are stored there) I think calling this as blacklisting will lead to more confusion. As Owen suggested we can call it as decommissioning/recommissioning of trackers which would essentially mean that irrespective of what state the tracker is, the jobtracker is asked to decommission(rerun+ignore)/recommission(add back) it. So the command would be
bin/hadoop jobtracker -decommission tracker1,tracker2.... and bin/hadoop jobtracker -recommission tracker1,tracker2..... All the running tasks (also completed maps) that were launched on that machine will be killed and rerun. We can reuse the lost-tracker code for doing this. Maybe a thread should be started on demand (similar to cleanup queue thread) for a decommissioning request. Also these tracker will be added to the ignore list (i.e issue a 'shutdown' upon contact). So a decommission request is equivalent to lost-tracker + add-to-ignore-list. Upon a recommission, the trackers will be removed from the ignore list. This can be done inline. From the webui, a simple checkbox against all the trackers can be provided and an action named 'Decommission' can be provided (similar to actions for jobs on jobtracker.jsp). On the trackers page, we can provide another section for decommissioned trackers and there we can provide a checkbox for recommissioning it. Note : Thoughts? One more thing i forgot to add is that the jobtracker already reads the hosts file and the exclude file but just once. There is no refresh facility to it. I think we can add that to MR too. So here is the sequence of things :
Do we have a good security story for actions taken through the web UI? Absent that, I'd suggest we don't enable this there.
Being able to modify the excludes file and hup the server is probably good enough for an operator. Had an offline discussion with Devaraj and we think it makes sense to provide a default location for mapred.hosts.exclude. The purpose of doing this is to provide persistence. The default file would be something like ${hadoop.log.dir}/history/hosts.exclude. By default the jobtracker persists the decommission/recommission host info in this file.
@Eric Attaching a patch implementing the above discussed approach. Result of test patch
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 9 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] -1 release audit. The applied patch generated 472 release audit warnings (more than the trunk's current 469 warnings).
Not clear why release audit warnings are there. This patch is tested on local box and testing is in progress. Will upload a new patch with fixed warnings and testcases. Also added a new parameter mapred.permissions.supergroup to allow admins specify supergroups. Either the user running the jobtracker or user in the supergroup can issue admin commands.
Attaching a patch with the test case. Result of test-patch
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 15 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
Testing in progress. While testing the patch, we found that manually changing the excludes file (maintained by the jobtracker) results in checksum error. So, for now we think keeping it as it is (i.e using java to write/read) makes more sense. Also here are some of the test bugs :
Will upload a new patch soon. >While testing the patch, we found that manually changing the excludes file (maintained by the jobtracker) results in checksum error.
For the name NameNode, this is allowed. JT should do the same. I think we are going through an expensive process of reinventing the wheel here. We should think about solving this sort of issue once by maintain such lists in a plugable source of configuration and supporting the ability to "hup" the service.
We should then implement config in LDAP / SQL / or some other service via plugins and then we can modify these configurations in an environment with lots of tools to support this stuff. Adding ad hock commands and odd side files that will be lost if we need to swap hardware is awkward. Attaching a patch fixing some bugs. Result of test-patch
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 15 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings
Running ant test now. Will post the results of ant test and cluster run. Attaching a patch that does what HDFS does. Testing the patch.
Result of test patch.
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 15 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
Running ant test
+1. I've opened HADOOP-5772 for the same. Attaching a patch the tried to provide the refresh facility similar to HDFS. Result of test-patch
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 21 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
Ant test passed on my box. Attaching a patch incorporating Devaraj's offline comment. Result of test-patch
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 15 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
Ant tests passed on my box. Attaching a patch incorporating Devaraj's offline comments.
Result of test-patch [exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 9 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
Running ant test now. I just committed this. Thanks, Amar! (Please add a release note describing the way to run the command for decommissioning TTs)
Integrated in Hadoop-trunk #833 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/833/
. Adding the file src/test/mapred/org/apache/hadoop/mapred/TestNodeRefresh.java that got missed earlier. . Adds a way to decommission TaskTrackers while the JobTracker is running. Contributed by Amar Kamat.
We should do a svn delete of this file.
I feel checkSuperuserPrivilege() should be used for simply checking superuser privilege (without permission switch which is just in HDFS for now) and checkAccess() for making a guarded call to checkSuperuserPrivilege(). The reason for doing this was to keep both the MR and HDFS consistent wrt superuser checks.
Then, why not using the name "checkSuperuserPrivilege" for superuser checks in both HDFS and MR? "checkAccess" does not seem to mean "check superuser". Also, "checkAccess" seems to be confusing in HDFS. In FSNamesystem, there are other methods checkPathAccess(..), checkParentAccess(..) and checkAncestorAccess(..) which are nothing to do with superuser. Integrated in Hadoop-trunk #834 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/834/
. Removing the empty file src/hdfs/org/apache/hadoop/hdfs/server/namenode/PermissionChecker.java. Nicholas,
If checkAccess() adds to confusion then we better revert the renaming. I filed
Example patch for 0.20 not to be committed.
Example patch not to be committed.
Attaching a new patch for branch 0.20 merging the 2 patches. Note that this is an example patch for 20 and not to be committed.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
can we use this feature for blocking(blacklisting) and decommisioning TaskTrackers?