|
[
Permlink
| « Hide
]
Nigel Daley added a comment - 22/Sep/08 04:59 PM
Amar, can we get this done for 0.19?
My intention behind this was to safeguard the framework against unexpected changes in jobhistory. Some features heavily use jobhistory to get things done. Any incompatible change to jobhistory might break the framework or render these features useless. Here are few things related to jobhistory that are assumed
1) jobhistory filename 2) jobhistory file location(s) 3) versioning .. handled in 4) jobhistory format 5) jobhistory line structure 6) jobhistory line content i.e keys and values 7) Features like job-recovery expect few fields to be present in a line. Any change to do with removal of these fields or changing their content/representation should be detected. That's quite a few things to keep track of. Instead of trying to track them only via a test case, we can centralize these requirements in something like a schema definition, which can then be used by jobhistory clients to validate a history file before its usage. Thoughts?
I am thinking more in terms of a precautionary measure. Instead of creating a history file and then testing whether it adheres to some schema, it would be nice to catch the incompatible change early and let the developer know the effect of the changes made. Otherwise the only way to figure the incompatibility would be see why some of the tests failed and infer the cause. Someone trying to change the history might know about its effect on job-recovery or history-viewer.
Here is a proposal for the new testcase for job history:
Run a job and then parse the job history file and validate the following: Thoughts ?
This has to done in two locations : mapred.job.hitsory.location and mapred.job.history.user.location
I guess this will be done by the parser itself. If anyline doesn't match recType {KEY =" VALUE "}* pattern, the parser will fail.
This will not be the case always. Incase of speculative tasks, TaskAttemptIDs will be killed after the TaskID finishes. ok. Will validate the file name and location for both user log file and framework log file.
Currently parser would ignore the rest of the line if the pattern KEY="VALUE" is not found somewhere on the line and will not give any errors. In this testcase we validate if the whole line matches the expected regex. Missed mentioning in earlier comment. For this testcase, we disable speculative execution and see if all the attemptIDs are finished before TaskID finishes. Discussed offline with Amar and came up with the following approach for testing job history:
Run a job that will be succeeded and validate its history file content Run jobs with for the given values of hadoop.job.history.user.location as (1)null(default case), (2)"none", and (3)some dir like "/tmp". Run jobs that will be (1) succeeded (2) failed (3) killed. Thoughts ? Attaching patch with the above approach. Patch also has few code changes to TestJobHistoryVersion such that more issues related to new versions of history files are caught.
Please review the patch and provide your comments. As per Amar's comments, done some cleanup of code, moved runJob(), runJobKill(), runJobFail() from TestJobKillAndFail to UtilsForTests. Added runJobSucceed() to UtilsForTests. Modified other dependant tests also. Added validation checks for the task Attempt level Keys TRACKER_NAME, HOSTNAME, HTTP_PORT and COUNTERS.
Attaching patch for review. Please provide your comments. Looks good. Just one minor comment. Can you also check if count for failed/killed maps/reducers are correct. I think you can merge this check with failed/killed job. Thoughts?
Hi Amar,
FAILED_MAPS, FAILED_REDUCES are logged only from logFinished() — i.e. for a successful job. Will be checking FINISHED_MAPS, FINISHED_REDUCES, FAILED_MAPS, FAILED_REDUCES for the successful job for which all other job level keys are validated. Is the following code being added to validateJobLevelKJeys() look fine ? (looks like too much dependant on the fact that validateJobLevelKeys() is called only after UtilsForTests.runJobSucceed() and that internally calls UtilsForTests.runJob(). Hmm may be we can send expected number of finished_maps, finished_reduces, failed_maps, failed_reduces as parameters ???) if (status.equals("SUCCESS")) { // Since UtilsForTests.runJob() sets number of maps to be run as 1 and // number of reduces to be run as 0, we expect the following values // for the case of runJobSucceed() String fin_maps = values.get(Keys.FINISHED_MAPS); assertTrue("Unexpected number of finished maps in history file", fin_maps == 1); String fin_reduces = values.get(Keys.FINISHED_REDUCES); assertTrue("Unexpected number of finished reduces in history file", fin_reduces == 0); String failed_maps = values.get(Keys.FAILED_MAPS); assertTrue("Unexpected number of failed maps in history file", failed_maps == 0); String failed_reduces = values.get(Keys.FAILED_REDUCES); assertTrue("Unexpected number of finished reduces in history file", failed_reduces == 0); } -Ravi some comments:
1. Any reason for using LocalFileSystem for the tests?
I thought the tests should be done with DFS. 2. should we have a test with history location on dfs? 3. I think the test is still missing validation for some fieds such as SPLITS, because earlier we have seen bugs where history stopped logging splits. 4. I propose we should validate all the fields of atleast one map task and one reduce task. Since RunningJob handle and JobClient (you can create one) handle are available, we could get the expected values for all these and validate the values in history. 5. If the test case is looking too big, we can refactor the tests into different files. thoughts? 1 & 2. Local file system seems to be good enough.
3. Will add validation of format of SPLITS and TOTAL_MAPS, TOTAL_REDUCES, FINISHED_MAPS, FINISHED_REDUCES, FAILED_MAPS, FAILED_REDUCES. 4. Will do validation of all keys, values of history file by comparing them with actual values obtained from JT. Added code for validation of all keys, values of history file by comparing them with actual values obtained from JT.
Attaching the patch. Please review and provide your comments. Unit tests passed on my local machine.
ant test-patch gave: [exec] +1 overall. Some comments:
1) The Listener should check the format of the TaskID, TaskAttemptID, JobID strings (just doing id.forName()) is sufficient). 2) We should also catch the case where END statement comes before the START (if this happens in a real jobhistory file, the recovery manager will panic). 3) We should check for duplicates (like multiple START or END lines for the same subject) and catch them. 4) It would be nice to capture checks for the sequence of history lines when a TT running this job's tasks is lost.
5. Also check if the time is sane. START_TIME = 0 doesnt make any sense (see Thanks Devaraj for your comments. Done the validation of history files obtained from TestLostTracker, TestJobTrackerRestart, TestJobTrackerRestartWithLostTracker and identified issues in history files
Attaching patch after incorporating code changes based on Devaraj's & Amar's comments. This patch will work only with the patches of above 4 JIRAs. Please review the patch and provide your comments. Attaching new patch fixing a minor bug.
I just committed this. Thanks, Ravi!
Integrated in Hadoop-trunk #766 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/766/
Attaching patch for branch 0.20. Needs to be committed after committing patches of 5276 & 5278. Unit tests passed on my machine.
I committed this to the 0.20 branch.
Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/
. Moving the comment about 4191 to the section under 0.20. Editorial pass over all release notes prior to publication of 0.21.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||