Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Changes in job history might break the history parser which in turn might break some features like jobtracker-recovery, history-viewer etc. There should be a testcase that catches these incompatible changes early and informs about the expected change.

      This patch validates job history files so that issues related to jobHistory are caught when changes to JobHistory are made.
      Validates history file name, file location, parsability. Validates user log location for varoius configuration settings. Validates job status for jobs that are succeeded, failed and killed.
      Validates format of history file contents and also validates contents by comparing with actual values obtained from job tracker.

      1. h20_4191.patch
        67 kB
        Ravi Gummadi
      2. HADOOP-4191.v4.1.patch
        66 kB
        Ravi Gummadi
      3. HADOOP-4191.v4.patch
        65 kB
        Ravi Gummadi
      4. HADOOP-4191.v3.patch
        59 kB
        Ravi Gummadi
      5. HADOOP-4191.v2.patch
        44 kB
        Ravi Gummadi
      6. HADOOP-4191.patch
        26 kB
        Ravi Gummadi

        Issue Links

          Activity

          Hide
          Nigel Daley added a comment -

          Amar, can we get this done for 0.19?

          Show
          Nigel Daley added a comment - Amar, can we get this done for 0.19?
          Hide
          Amar Kamat added a comment -

          My intention behind this was to safeguard the framework against unexpected changes in jobhistory. Some features heavily use jobhistory to get things done. Any incompatible change to jobhistory might break the framework or render these features useless. Here are few things related to jobhistory that are assumed
          1) jobhistory filename
          2) jobhistory file location(s)
          3) versioning .. handled in HADOOP-4190
          4) jobhistory format
          5) jobhistory line structure
          6) jobhistory line content i.e keys and values
          7) Features like job-recovery expect few fields to be present in a line. Any change to do with removal of these fields or changing their content/representation should be detected.

          Show
          Amar Kamat added a comment - My intention behind this was to safeguard the framework against unexpected changes in jobhistory. Some features heavily use jobhistory to get things done. Any incompatible change to jobhistory might break the framework or render these features useless. Here are few things related to jobhistory that are assumed 1) jobhistory filename 2) jobhistory file location(s) 3) versioning .. handled in HADOOP-4190 4) jobhistory format 5) jobhistory line structure 6) jobhistory line content i.e keys and values 7) Features like job-recovery expect few fields to be present in a line. Any change to do with removal of these fields or changing their content/representation should be detected.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          That's quite a few things to keep track of. Instead of trying to track them only via a test case, we can centralize these requirements in something like a schema definition, which can then be used by jobhistory clients to validate a history file before its usage. Thoughts?

          Show
          Vinod Kumar Vavilapalli added a comment - That's quite a few things to keep track of. Instead of trying to track them only via a test case, we can centralize these requirements in something like a schema definition, which can then be used by jobhistory clients to validate a history file before its usage. Thoughts?
          Hide
          Amar Kamat added a comment -

          I am thinking more in terms of a precautionary measure. Instead of creating a history file and then testing whether it adheres to some schema, it would be nice to catch the incompatible change early and let the developer know the effect of the changes made. Otherwise the only way to figure the incompatibility would be see why some of the tests failed and infer the cause. Someone trying to change the history might know about its effect on job-recovery or history-viewer.

          Show
          Amar Kamat added a comment - I am thinking more in terms of a precautionary measure. Instead of creating a history file and then testing whether it adheres to some schema, it would be nice to catch the incompatible change early and let the developer know the effect of the changes made. Otherwise the only way to figure the incompatibility would be see why some of the tests failed and infer the cause. Someone trying to change the history might know about its effect on job-recovery or history-viewer.
          Hide
          Ravi Gummadi added a comment -

          Here is a proposal for the new testcase for job history:

          Run a job and then parse the job history file and validate the following:
          (1) history file name and location are validated.
          (2) Each line's format is checked against a regex.
          (3) Format of all TIMEs are checked against a regex.
          (4) Checking if the VALUE in history file for a partivcular KEY is valid, specifically for KEYs like JOB_STATUS, TASK_TYPE, TASK_STATUS, JOB_PRIORITY.
          (5) Check if changes to properties like Job_PRIORITY are logged to the history file.
          (6) We maintain a Map<String, List<String>> of TaskIDs to TaskAttemptIDs and see if all the atemptIDs are finished before the TaskID finishes.
          We match finish of each TaskAttemptIDs, TaskIDs against its start to make sure that all TaskAttempts, Tasks and Jobs started are
          indeed finished and the history log lines for them are in proper order.

          Thoughts ?

          Show
          Ravi Gummadi added a comment - Here is a proposal for the new testcase for job history: Run a job and then parse the job history file and validate the following: (1) history file name and location are validated. (2) Each line's format is checked against a regex. (3) Format of all TIMEs are checked against a regex. (4) Checking if the VALUE in history file for a partivcular KEY is valid, specifically for KEYs like JOB_STATUS, TASK_TYPE, TASK_STATUS, JOB_PRIORITY. (5) Check if changes to properties like Job_PRIORITY are logged to the history file. (6) We maintain a Map<String, List<String>> of TaskIDs to TaskAttemptIDs and see if all the atemptIDs are finished before the TaskID finishes. We match finish of each TaskAttemptIDs, TaskIDs against its start to make sure that all TaskAttempts, Tasks and Jobs started are indeed finished and the history log lines for them are in proper order. Thoughts ?
          Hide
          Amareshwari Sriramadasu added a comment -

          (1) history file name and location are validated.

          This has to done in two locations : mapred.job.hitsory.location and mapred.job.history.user.location

          (2) Each line's format is checked against a regex.

          I guess this will be done by the parser itself. If anyline doesn't match recType

          {KEY =" VALUE "}

          * pattern, the parser will fail.

          (6) We maintain a Map<String, List<String>> of TaskIDs to TaskAttemptIDs and see if all the atemptIDs are finished before the TaskID finishes.

          This will not be the case always. Incase of speculative tasks, TaskAttemptIDs will be killed after the TaskID finishes.

          Show
          Amareshwari Sriramadasu added a comment - (1) history file name and location are validated. This has to done in two locations : mapred.job.hitsory.location and mapred.job.history.user.location (2) Each line's format is checked against a regex. I guess this will be done by the parser itself. If anyline doesn't match recType {KEY =" VALUE "} * pattern, the parser will fail. (6) We maintain a Map<String, List<String>> of TaskIDs to TaskAttemptIDs and see if all the atemptIDs are finished before the TaskID finishes. This will not be the case always. Incase of speculative tasks, TaskAttemptIDs will be killed after the TaskID finishes.
          Hide
          Ravi Gummadi added a comment -

          ok. Will validate the file name and location for both user log file and framework log file.

          Currently parser would ignore the rest of the line if the pattern KEY="VALUE" is not found somewhere on the line and will not give any errors. In this testcase we validate if the whole line matches the expected regex.

          Missed mentioning in earlier comment. For this testcase, we disable speculative execution and see if all the attemptIDs are finished before TaskID finishes.

          Show
          Ravi Gummadi added a comment - ok. Will validate the file name and location for both user log file and framework log file. Currently parser would ignore the rest of the line if the pattern KEY="VALUE" is not found somewhere on the line and will not give any errors. In this testcase we validate if the whole line matches the expected regex. Missed mentioning in earlier comment. For this testcase, we disable speculative execution and see if all the attemptIDs are finished before TaskID finishes.
          Hide
          Ravi Gummadi added a comment -

          Discussed offline with Amar and came up with the following approach for testing job history:

          Run a job that will be succeeded and validate its history file content
          (1) history file exists and in correct location
          (2) Verify if the history file is parsable
          (3) Validate the contents of history file
          (a) validate job level key, values
          (b) validate task level key, values
          (c) validate attempt level key, values
          (d) check if all the TaskAttempts, Tasks started are finished

          Run jobs with for the given values of hadoop.job.history.user.location as (1)null(default case), (2)"none", and (3)some dir like "/tmp".
          Validate user history file location in each case.

          Run jobs that will be (1) succeeded (2) failed (3) killed.
          Validate job status read from history file in each case.

          Thoughts ?

          Show
          Ravi Gummadi added a comment - Discussed offline with Amar and came up with the following approach for testing job history: Run a job that will be succeeded and validate its history file content (1) history file exists and in correct location (2) Verify if the history file is parsable (3) Validate the contents of history file (a) validate job level key, values (b) validate task level key, values (c) validate attempt level key, values (d) check if all the TaskAttempts, Tasks started are finished Run jobs with for the given values of hadoop.job.history.user.location as (1)null(default case), (2)"none", and (3)some dir like "/tmp". Validate user history file location in each case. Run jobs that will be (1) succeeded (2) failed (3) killed. Validate job status read from history file in each case. Thoughts ?
          Hide
          Ravi Gummadi added a comment -

          Attaching patch with the above approach. Patch also has few code changes to TestJobHistoryVersion such that more issues related to new versions of history files are caught.

          Please review the patch and provide your comments.

          Show
          Ravi Gummadi added a comment - Attaching patch with the above approach. Patch also has few code changes to TestJobHistoryVersion such that more issues related to new versions of history files are caught. Please review the patch and provide your comments.
          Hide
          Ravi Gummadi added a comment -

          As per Amar's comments, done some cleanup of code, moved runJob(), runJobKill(), runJobFail() from TestJobKillAndFail to UtilsForTests. Added runJobSucceed() to UtilsForTests. Modified other dependant tests also. Added validation checks for the task Attempt level Keys TRACKER_NAME, HOSTNAME, HTTP_PORT and COUNTERS.

          Attaching patch for review. Please provide your comments.

          Show
          Ravi Gummadi added a comment - As per Amar's comments, done some cleanup of code, moved runJob(), runJobKill(), runJobFail() from TestJobKillAndFail to UtilsForTests. Added runJobSucceed() to UtilsForTests. Modified other dependant tests also. Added validation checks for the task Attempt level Keys TRACKER_NAME, HOSTNAME, HTTP_PORT and COUNTERS. Attaching patch for review. Please provide your comments.
          Hide
          Amar Kamat added a comment -

          Looks good. Just one minor comment. Can you also check if count for failed/killed maps/reducers are correct. I think you can merge this check with failed/killed job. Thoughts?

          Show
          Amar Kamat added a comment - Looks good. Just one minor comment. Can you also check if count for failed/killed maps/reducers are correct. I think you can merge this check with failed/killed job. Thoughts?
          Hide
          Ravi Gummadi added a comment -

          Hi Amar,
          FAILED_MAPS, FAILED_REDUCES are logged only from logFinished() — i.e. for a successful job.
          Will be checking FINISHED_MAPS, FINISHED_REDUCES, FAILED_MAPS, FAILED_REDUCES for the successful job for which all other job level keys are validated.
          Is the following code being added to validateJobLevelKJeys() look fine ? (looks like too much dependant on the fact that validateJobLevelKeys() is called only after UtilsForTests.runJobSucceed() and that internally calls UtilsForTests.runJob(). Hmm may be we can send expected number of finished_maps, finished_reduces, failed_maps, failed_reduces as parameters ???)

          if (status.equals("SUCCESS"))

          { // Since UtilsForTests.runJob() sets number of maps to be run as 1 and // number of reduces to be run as 0, we expect the following values // for the case of runJobSucceed() String fin_maps = values.get(Keys.FINISHED_MAPS); assertTrue("Unexpected number of finished maps in history file", fin_maps == 1); String fin_reduces = values.get(Keys.FINISHED_REDUCES); assertTrue("Unexpected number of finished reduces in history file", fin_reduces == 0); String failed_maps = values.get(Keys.FAILED_MAPS); assertTrue("Unexpected number of failed maps in history file", failed_maps == 0); String failed_reduces = values.get(Keys.FAILED_REDUCES); assertTrue("Unexpected number of finished reduces in history file", failed_reduces == 0); }

          -Ravi

          Show
          Ravi Gummadi added a comment - Hi Amar, FAILED_MAPS, FAILED_REDUCES are logged only from logFinished() — i.e. for a successful job. Will be checking FINISHED_MAPS, FINISHED_REDUCES, FAILED_MAPS, FAILED_REDUCES for the successful job for which all other job level keys are validated. Is the following code being added to validateJobLevelKJeys() look fine ? (looks like too much dependant on the fact that validateJobLevelKeys() is called only after UtilsForTests.runJobSucceed() and that internally calls UtilsForTests.runJob(). Hmm may be we can send expected number of finished_maps, finished_reduces, failed_maps, failed_reduces as parameters ???) if (status.equals("SUCCESS")) { // Since UtilsForTests.runJob() sets number of maps to be run as 1 and // number of reduces to be run as 0, we expect the following values // for the case of runJobSucceed() String fin_maps = values.get(Keys.FINISHED_MAPS); assertTrue("Unexpected number of finished maps in history file", fin_maps == 1); String fin_reduces = values.get(Keys.FINISHED_REDUCES); assertTrue("Unexpected number of finished reduces in history file", fin_reduces == 0); String failed_maps = values.get(Keys.FAILED_MAPS); assertTrue("Unexpected number of failed maps in history file", failed_maps == 0); String failed_reduces = values.get(Keys.FAILED_REDUCES); assertTrue("Unexpected number of finished reduces in history file", failed_reduces == 0); } -Ravi
          Hide
          Amareshwari Sriramadasu added a comment -

          some comments:
          1. Any reason for using LocalFileSystem for the tests?

          mr = new MiniMRCluster(2, "file:///", 3);

          I thought the tests should be done with DFS.

          2. should we have a test with history location on dfs?

          3. I think the test is still missing validation for some fieds such as SPLITS, because earlier we have seen bugs where history stopped logging splits.

          4. I propose we should validate all the fields of atleast one map task and one reduce task.
          For map task attempt, we should validate TASK_ATTEMPT_ID, TASKID, START_TIME, FINISH_TIME, TRACKER_NAME, HTTP_PORT, TASK_STATUS, HOSTNAME, STATE_STRING, COUNTERS, ERROR (in case of failure)
          For map task, we should validate TASKID, START_TIME, FINISH_TIME, TASK_STATUS, SPLITS, COUNTERS.
          For reduce task attempt, we should validate TASK_ATTEMPT_ID, TASKID, START_TIME, SHUFFLE_FINISHED, SORT_FINISHED, FINISH_TIME, TRACKER_NAME, HTTP_PORT, TASK_STATUS, HOSTNAME, STATE_STRING, COUNTERS, ERROR (in case of failure).

          Since RunningJob handle and JobClient (you can create one) handle are available, we could get the expected values for all these and validate the values in history.
          Thoughts?

          5. If the test case is looking too big, we can refactor the tests into different files. thoughts?

          Show
          Amareshwari Sriramadasu added a comment - some comments: 1. Any reason for using LocalFileSystem for the tests? mr = new MiniMRCluster(2, "file:///", 3); I thought the tests should be done with DFS. 2. should we have a test with history location on dfs? 3. I think the test is still missing validation for some fieds such as SPLITS, because earlier we have seen bugs where history stopped logging splits. 4. I propose we should validate all the fields of atleast one map task and one reduce task. For map task attempt, we should validate TASK_ATTEMPT_ID, TASKID, START_TIME, FINISH_TIME, TRACKER_NAME, HTTP_PORT, TASK_STATUS, HOSTNAME, STATE_STRING, COUNTERS, ERROR (in case of failure) For map task, we should validate TASKID, START_TIME, FINISH_TIME, TASK_STATUS, SPLITS, COUNTERS. For reduce task attempt, we should validate TASK_ATTEMPT_ID, TASKID, START_TIME, SHUFFLE_FINISHED, SORT_FINISHED, FINISH_TIME, TRACKER_NAME, HTTP_PORT, TASK_STATUS, HOSTNAME, STATE_STRING, COUNTERS, ERROR (in case of failure). Since RunningJob handle and JobClient (you can create one) handle are available, we could get the expected values for all these and validate the values in history. Thoughts? 5. If the test case is looking too big, we can refactor the tests into different files. thoughts?
          Hide
          Ravi Gummadi added a comment -

          1 & 2. Local file system seems to be good enough.

          3. Will add validation of format of SPLITS and TOTAL_MAPS, TOTAL_REDUCES, FINISHED_MAPS, FINISHED_REDUCES, FAILED_MAPS, FAILED_REDUCES.

          4. Will do validation of all keys, values of history file by comparing them with actual values obtained from JT.

          Show
          Ravi Gummadi added a comment - 1 & 2. Local file system seems to be good enough. 3. Will add validation of format of SPLITS and TOTAL_MAPS, TOTAL_REDUCES, FINISHED_MAPS, FINISHED_REDUCES, FAILED_MAPS, FAILED_REDUCES. 4. Will do validation of all keys, values of history file by comparing them with actual values obtained from JT.
          Hide
          Ravi Gummadi added a comment -

          Added code for validation of all keys, values of history file by comparing them with actual values obtained from JT.

          Attaching the patch. Please review and provide your comments.

          Show
          Ravi Gummadi added a comment - Added code for validation of all keys, values of history file by comparing them with actual values obtained from JT. Attaching the patch. Please review and provide your comments.
          Hide
          Amareshwari Sriramadasu added a comment -

          patch looks fine

          Show
          Amareshwari Sriramadasu added a comment - patch looks fine
          Hide
          Ravi Gummadi added a comment -

          Unit tests passed on my local machine.

          ant test-patch gave:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 18 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Show
          Ravi Gummadi added a comment - Unit tests passed on my local machine. ant test-patch gave: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 18 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          Devaraj Das added a comment -

          Some comments:
          1) The Listener should check the format of the TaskID, TaskAttemptID, JobID strings (just doing id.forName()) is sufficient).
          2) We should also catch the case where END statement comes before the START (if this happens in a real jobhistory file, the recovery manager will panic).
          3) We should check for duplicates (like multiple START or END lines for the same subject) and catch them.
          4) It would be nice to capture checks for the sequence of history lines when a TT running this job's tasks is lost.

          Show
          Devaraj Das added a comment - Some comments: 1) The Listener should check the format of the TaskID, TaskAttemptID, JobID strings (just doing id.forName()) is sufficient). 2) We should also catch the case where END statement comes before the START (if this happens in a real jobhistory file, the recovery manager will panic). 3) We should check for duplicates (like multiple START or END lines for the same subject) and catch them. 4) It would be nice to capture checks for the sequence of history lines when a TT running this job's tasks is lost.
          Hide
          Amar Kamat added a comment -

          Some comments:

          5. Also check if the time is sane. START_TIME = 0 doesnt make any sense (see HADOOP-5276). Also make sure that FINISH_TIME > START_TIME for each attempt, task and job.

          Show
          Amar Kamat added a comment - Some comments: 5. Also check if the time is sane. START_TIME = 0 doesnt make any sense (see HADOOP-5276 ). Also make sure that FINISH_TIME > START_TIME for each attempt, task and job.
          Hide
          Ravi Gummadi added a comment -

          Thanks Devaraj for your comments. Done the validation of history files obtained from TestLostTracker, TestJobTrackerRestart, TestJobTrackerRestartWithLostTracker and identified issues in history files HADOOP-5272, HADOOP-5276, HADOOP-5278, HADOOP-5282. Thanks Amar for fixing these new issues fast. Testcase is also modified.

          Attaching patch after incorporating code changes based on Devaraj's & Amar's comments. This patch will work only with the patches of above 4 JIRAs.

          Please review the patch and provide your comments.

          Show
          Ravi Gummadi added a comment - Thanks Devaraj for your comments. Done the validation of history files obtained from TestLostTracker, TestJobTrackerRestart, TestJobTrackerRestartWithLostTracker and identified issues in history files HADOOP-5272 , HADOOP-5276 , HADOOP-5278 , HADOOP-5282 . Thanks Amar for fixing these new issues fast. Testcase is also modified. Attaching patch after incorporating code changes based on Devaraj's & Amar's comments. This patch will work only with the patches of above 4 JIRAs. Please review the patch and provide your comments.
          Hide
          Ravi Gummadi added a comment -

          Attaching new patch fixing a minor bug.

          Show
          Ravi Gummadi added a comment - Attaching new patch fixing a minor bug.
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Ravi!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Ravi!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #766 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/766/ )
          Hide
          Ravi Gummadi added a comment -

          Attaching patch for branch 0.20. Needs to be committed after committing patches of 5276 & 5278. Unit tests passed on my machine.

          Show
          Ravi Gummadi added a comment - Attaching patch for branch 0.20. Needs to be committed after committing patches of 5276 & 5278. Unit tests passed on my machine.
          Hide
          Devaraj Das added a comment -

          I committed this to the 0.20 branch.

          Show
          Devaraj Das added a comment - I committed this to the 0.20 branch.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/)
          . Moving the comment about 4191 to the section under 0.20.

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ ) . Moving the comment about 4191 to the section under 0.20.
          Hide
          Robert Chansler added a comment -

          Editorial pass over all release notes prior to publication of 0.21.

          Show
          Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.

            People

            • Assignee:
              Ravi Gummadi
              Reporter:
              Amar Kamat
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development