|
> we don't have appends [ ... ]
We will though. So, if you don't use HDFS, it might still be worth trying to use the FileSystem API to talk to the local FS, so that a subsequent move to HDFS would be trivial, a matter of configuration. Perhaps we should start supporting append in the FileSystem API and in the local implementation now, to permit this? Doug that makes sense. However, the concern there is this that for each task update, we need to do a sync to dfs (sync because we want to be sure the info is there in the dfs). That might be expensive when TIPs complete at a high rate, no? Also, even in this case, we most likely still need to do the edits log kind of merge since the file will just contain a bunch of updates.
There is one problem with this approach of periodic merges. We lose the information for completed tasks that completes in the interval between the last merge and the time when the JobTracker crashes. So in the next restart the JobTracker would try to reexecute these tasks but their outputs would already be present on the dfs and conflicts will happen when save output is invoked for these new attempts (this problem becomes a bit more complicated with the side files in the picture). We probably should have a strategy (job configurable?) to handle such cases on a per path basis - OVERWRITE (if an output path already exists) or ACCEPT (accept what is already there). Assuming idempotent tasks we probably should have the default as ACCEPT... bq There is one problem with this approach of periodic merges. We lose the information for completed tasks that completes in the interval between the last merge and the time when the JobTracker crashes. Are you suggest that you intend to use persisted job status as the basis for the job tracker to started from where it crashed? I think we need to work out what is the intended behavior to re-start a job tracker, and what needs to be persisted to support that behavior.
Agree. This jira is more about being able to save all Job states and being able to create images that can later be used to recreate the job data structures. Actually restarting a job would be the next step (a separate jira probably). This issue is about a framework to enable job restarts.
Agree and the proposal here doesn't intend to tweak 1876. It could leverage some APIs from there (just a thought), but at this point it seems like having edits log kind of a mechanism should work well.. Here is a proposal
1) A job can be one of the following states - queued, running, completed. Queued and running jobs needs to survive JT restarts and hence will be represented as queued and running folders under backup directory. The reason we need to do this is basically its cleaner to distinguish queued and running jobs. We need to see if we want this feature to be configurable (in terms of enabling backup mode and specifying backup location) or always check the queued/running folders under mapred.system.dir to auto-detect if the JT got restarted. For now this backup directory will be the mapred.system.dir. 2) On job submission, the job first gets queued up (goes in the job structures and also gets persisted on the FS as queued). For now we will use a common structure to hold both the queued and running jobs. In future (say after the new scheduler comes in) we might need to have separate queues/lists for the same. 3) A job jumps from queued state to running state only when the JobInitThread selects it for initialization/running [ consider initialized/expanded job is a running job]. As of now all the jobs will transit from queued to running immediately. But in future the decision of which job to initialize will be pretty complex/involved. 4) Running jobs need following information for restarts : 4.1) TIP info : What all TIPs are there in the job and what is their locality info.
This could be rebuilt from job.xml which is in *mapred.system.dir*. Hence on JT restart, we should be careful
while clearing the mapred system directory. Currently the JobTracker will switch to RESTART mode if there
is some stale data in the queued/running backup folders. If we decide to keep the backup feature configurable then the
JT will also check if its enabled.
4.2) Completed task statuses : (Sameer's suggestion) TaskStatus info can be obtained from the TTs.
More details are stated below (see SYNC ALGO)
5) SYNC ALGO : Algo for sync up the JT and TTs : 5.1) On Demand Sync :
Have SYNC operation for the TaskTrackers. Following are the ways to achieve on-demand sync
5.1.1) Greedy :
a) TT sends an old heartbeat to a new restarted JT. The JT on restart check the backup folders and detects
if its in restart mode or not.
b) Once the JT in restart mode receives a heartbeat which is not the *first* contact, it considers that the
TT is from its previous incarnation and sends a SYNC command.
c) TT receives a SYNC operation, adds the completed statuses of running jobs to the current heartbeat
and SYNCs up with the JT making this contact as *initial*.
d) JT receives the updated heartbeat as a new contact, updates the internal structures.
e) JT replies with new tasks if asked.
5.1.2) Waited :
Similar to 6.1.1 but doesn't give out tasks immediately. Waits for some time and then serves out the tasks.
The question to answer is how much to wait? How to detect that all the TTs have SYNCed?
For 5.1.1, the rate at which the TTs SYNCs with the JT will be faster and hence the
overhead should not be much. Also we could avoid scheduling tasks on SYNC operation. Thoughts?
5.2) Other options?
6) Problems/Issues : II) Consider the following scenario : 1. JT schedules task1, task2 to TT1
2. JT schedules task3, task4 to TT2
3. TT1 informs JT about task1/task2 completion
4. JT restarts
5. JT receives SYNC from TT1.
6. JT syncs up and schedules task3 to TT1
7. TT1 starts task3 and this might interfere with the side effect files of task3 on TT2.
8. In the worst case task3 could be running on TT1 and JT schedules task3 on TT1 in which case the local folders will also
clash.
One way to overcome this is to include identifiers to distinguish between the task attempts across JT restarts. We can use JT's timestamp as an identifier. III) The logic for detecting lost TT should not rely on missing data structures but use some kind of book keeping. We can now use 'missing data structures logic' for detecting when the TT should SYNC. Note that detecting a TT as lost (missing TT details) if different from declaring it as lost (10min gap in heartbeat). So for now we should Thoughts?
These are two different cases where To clarify a bit: we considered two different approaches for providing persistence in the JobTracker. The first one was to persist completed task information to a log file, similar to the method in the NameNode edits log. The second one is what Amar has described above, where TaskTrackers send 'Task Reports' to the restarted JT on demand. We feel the second approach is preferable, as it scales better.
However, the second approach does have a couple of issues:
Please do share your thoughts on whether these points make sense. There is a deficiency with the SYNC algo mentioned above. The task completion events in the JT will now be reordered. The copy of task completion events that the TT has is now garbled. Consider the following cases with the SYNC algo in place
1) A TT has reducers in SHUFFLE phase : Here the TT will have a partial copy of the previous version of the task completion events and hence will require the latest copy. 2) A TT has all the reducers in REDUCE phase before the JT restart and new reduce tasks assigned to it after the restart : Here the TT will have a complete copy of the previous version of the task completion events and hence it looks like it might not require the latest copy of completion events. But consider a case where some maps were lost and hence their output is not available. Since the ordering of the task completion events is different and some events in the JT might belong to re-executed maps, there is no good way to inform the TTs that a particular map was lost and it should use the new task completion event. Currently (with trunk) this is not a issue because there will be just one copy of the task completion events in the lifetime of the job. Hence the TT will always have the correct ordering of the completion events. In case of re-executions the new event will always be at the end. One simple solution would be to force the re-build of completion events copy at the TT on SYNC action. Attaching a review-patch implementing the above discussed design. Testing and optimizations are in progress. There is one known issue with the design and hence the patch is incomplete.
Consider the following case [Notations : JT@ti means JT (re)started at time t1, Ti@TTj means Task Ti completed on Tracker TTj] The design prefers the old map output location and silently ignores the new task completion event. In such a case R1 has missed the new event and will keep re-trying at the old location. Even though R1 will report fetch failures, it will be a no-op since JT@t2 doesnt know about M1@TT1. JT@t2 thinks that M1 is complete while R1@TT2 will wait for map-completion-event of M1 and hence the job will be stuck. Note that this also true if TT1 joins back after M1@TT3 completes where JT@t2 will delete the output of M1@TT1. Following is the change that might help to overcome this problem. Let the reducers fetch data for the same map from multiple sources (i.e R1 will keep fetching data from M1@TT1 and also from M1@TT3). The one that finishes first will invalidate the other. One optimization that can be done is that the reducer can continue fetching from the old output (since the timestamp is always there) and switch to the new event once there is a failure from the old event (i.e keep M1@TT3 as a backup and keep fetching from M1@TT1 until that fails after which switch to M1@TT3). Thoughts? Attaching a patch for review. Following are the changes
1) The bug discussed here 2) The job directories for restored jobs are checked for completeness before adding to the queue. A job directory is considered complete when it has job.xml, job.jar and job.split. 3) There was one corner case where the jobtracker dies with a job as completed (job dir missing) before communicating to the tasktracker i.e task trackers still have the task statuses for the completed job. The way this is handled is that the jobtracker on receiving an update request for a missing job will ask all the TTs to clear this job's details. 4) Restart mode turned off : The restart mode is turned off after some time. This is useful as we dont want the JT to entertain latecomers. The JT comes out of restart mode using the following equation 5) The web ui now shows the restart information. It shows whether the JT is still recovering and the time it has taken to recover. Issues taken care of : Known issues :
4) Some job level parameters can be changed dynamically after the job is submitted e.g job priority etc. Since it is not persisted, there is no way for the JT to load it upon restart. Also consider the following case where the user asks the JT to kill a job and the JT crashes before deleting the persisted structures. Upon restart the JT will continue to run the supposed-to-be-dead job. Hence all such external modifications (events) to the submitted job should be logged/persisted. Had an offline discussion with Devaraj. Attaching a patch that includes his comments.
1) Remove some security checks since HADOOP-3578 should take care of that. 2) Remove the queued concept since HADOOP-3444 should take care of that. As mentioned earlier, the job runtime will be useless once the JT restarts as the job could be running for more time than the time displayed. There are some ways in which this problem can be solved but looks like it might require some more work. We think that the problem of job-runtime, job-priority and job-kill/task-kill can be handled in a separate issue. With this patch we will show the new job runtime as 0 indicating that the job runtime is unknown. Some initial comments:
1) Remove the unnecessary comments from JobTracker.java 2) Rename the "restarted" field as "recovering" 3) hasJobTrackerRestarted/Recovered API 4) Remove the comment: "//TODO wait for all the incomplete(previously running) jobs to be ready" from offerService 5) Put back the call to completedJobStatusStore.store in finalizeJob 6) The method cleanupJob seems unnecessary. What is already done w.r.t cleanup will continue to work. 7) The implementation of wasRecovered and hasRecovered should not make a back call to the JobTracker 8) Synchronization for tasksInited in initTasks is redundant. Do a notify instead of notifyAll in the following line. 9) In the interval between the JT death and restart the reducers might fail to fetch map outputs from some tasktrackers (due to faulty map nodes, etc.), but it has no one to send the notifications to. The reducers might end up killing themselves after a couple of retries. 10) The construction of TaskTrackerStatus should be reverted to how it was done earlier (cloneAndResetRunningTaskStatuses called inline with the constructor invocation) 11) In TaskTracker.transmitHeartBeat you should call cloneAndResetRunningJobTaskStatuses rather than cloneAndResetRunningTaskStatuses 12) Pls move the SYNC action handling to the offerService method 13) shouldResetEventsIndex could be cleared upon the first access as opposed to doing it in the heartbeat processing 14) Instead of the additional RPC in Umbilical, you can add an arg in the getMapCompletionEvents to know whether to reset or not 15) Factor out common code from cloneAndResetRunningJobTaskStatuses/cloneAndResetRunningTaskStatuses This requires a testcase also
Attaching a patch that has the following changes
1) Devaraj's comments 2) Test case incorporated. It does the following 2.1) Start DFS, MR 2.2) Submit a job 2.3) Kill/Close the JT 2.4) Restart the JT 2.5) Check if the job got detected and was successful. 3) What is the effect of JT getting killed on the DFS i.e what happens to the incomplete DFS operation started by the JT before getting killed. Also what happens after the restart when the JT tries to access the same set of files. This requires investigation but as of now I dont see any noticeable effects. 4) Since the child task needs to reset it offset into the map-task-completion-events (TT's local copy), I have introduced a new class that encapsulates 4.1) An array of map-task-completion-events (if any) as requested by the child task 4.2) A boolean which decides whether the child should reset its offset. Me, Hemanth and Devaraj had a discussion on this and we feel its better and cleaner to do it this way. 5) Some bug fixes. Known issues (summary): The only problem with this
If we are counting on the TaskTracker's reports to rebuild the state, we should have a safe-mode equivalent where we wait for 2-3 minutes before launching new tasks, otherwise we will trash the cluster as each new TaskTracker reports back. Please also make sure that the TaskTracker does not reset and lose state if it gets an IOException when talking to the JobTracker.
However, rather than have TaskTracker's store additional information about the final task status of each completed task in ram, I think we should reconsider the option of using the JobHistory as a transaction log for each job. For storage on local disk, we probably should support writing a second copy to NFS so that a different node could bring up the JobTracker. In any case, the extra completed task state should be on disk rather than ram. We also need to make sure that the JobHistory is complete and consistent even after the restoration. +1 to Owen's comment.
A pressing need of our cluster is to not interrupt running jobs if the entire cluster has to be restarted. This means that job states have to be persisted in the form of a transaction log. This requirement is all the more beneficial to sites that have long-running job trackers (instead of HOD). However, isn't it better to be able to store state in HDFS? It is true that HDFS stores its transaction log in local files, but with the current focus on improving HDFS read/write latencies, HDFS itself is considering whether to store one copy of the transaction log in HDFS blocks (instead of NFS). In fact, if the JobTracker stores information in a org.apache.hadoop.fs.FileSystem, then a typical customer install could plug in various forms of storage to support the JobTracker transaction log. Using job history seems a reasonable approach. Some concerns though:
We're doing some tests related to the first two points and then can discuss the results. The completed task state in RAM is not introduced in this patch. I would recommend it be addressed in another JIRA, if it is an issue. Attaching a patch that implements JT restart using JobHistory.
Changes : Working : 1) For a new job, the history file is the master-file. For a restarted job, the history is written to the tmp file. 2) Following checks are made for a recovered job 3) Upon restart the master-file is read and default-history-parser is used to parse and recover history records. These records are used to create taskStatus which is replayed in order. Before replaying the JT waits for the jobs to be inited. 4) Once the replay is over, delete the master-file to indicate that the tmp file is more recent. Note that on next restart the tmp file will be used for recovery. 5) Once all the jobs are recovered, turn off the safe mode. JT will now process heartbeats (called as successful re-connect). Also the registration window timer starts. JT waits for tracker-expiry-interval time after last-tracker-re-connect before closing the window. Once the window closes, JT is considered as recovered. This plays an important role in detecting the trackers that went down while the JT was down. Upon recovery, JT re-executes all the tasks that were on the lost trackers. 6) Since the history can have some data missing, there can be a case where the map-completion-event-list at the JT is smaller than the one at the tracker. Hence there is a rollback required upon restart. Once the JT is out of safe mode, it passes this information (map-events-list-size) to the tracker on the successful reconnect. 7) The tasktracker rollbacks few events and asks the child tasks to reset their index to 0. Child tasks fetches all the events back and filters out necessary events for further processing. This is similar to the one discussed in approach #1. 8) Errors in history can cause the parser to fail. We have 9) Currently counters are stringified and written to history. It is not possible to recover the counter back from the string and hence this patch encodes the counter-names so that they can be easily recovered. Note that there is no encoding in the user space. Only the frameworks history file has codes. 10) Once the job finishes the tmp file is renamed to master-file. Similarly the history files in the user directory also follow the same renaming cycle. 11) Job priority is logged on every change and hence its recovered. Issues : 2) Detecting job runtime is still an issue. We are working on it. Todo : Note that the logs/debugging-code/testing-code is still a part of this patch as I am testing it. Regarding the file being zero size, you can work around like this (on trunk):
1. open the file with the FileSysterm.append() API. in a loop unteil it stops geenrating AlreadyBeingCreatedException. Opening a file for "append" tells the namenode to recover a lease, but only if the soft-limit time has expired. The soft-limit timer is 1 minute. The namenode then starts lease recovery and finds out the size of the last block of the file. Nowm if the client re-opens the file, it gets a new block list with updated block size.
The recently uploaded patch makes the block size of the history-file configurable. The parameter to change it is mapred.jobtracker.job.history.block.size (in bytes). For testing it, I started a small job on a single node cluster with history-block-size as 1k. It seems to work fine and the history file is now visible and available. With respect to the zero-size-file, I think we are ok to lose some small amount of information. I think you would not lose any data (even if small). If you have done a sync after a write to a HDFS file and then do not do anything else to that file for 1 hour, the namenode will recover that file and will make all the data that was synced earlier to appear in the file.
Is this time (1hr) configurable? Can it be configured on per file basis? The time os 1 hour is not configurable. This is the hard-limit of a lease recovery and is internal to the NameNode.
Had an offline discussion with Devaraj and Hemanth. Attaching a patch the incorporates their comments which are as follows
1) Avoid safe mode by delaying the start of ipc server. The ipc server starts only after the JobTracker has recovered. This avoids any extra coding at the tasktracker side. 2) Avoid having any registration window as TrackerExpiryThread will take care of the tracker that were lost while the tracker was down. 3) Remove unnecessary changes done to the Task/TaskAttempt logging with respect to passing of counters Things taken care from the todo list Things that need more work/discussion I am currently testing the patch on a larger cluster. I have seen that having a "safe-mode" kind of thing for the JobTracker is really nice to have (maybe not as part of this patch). When safe-mode is enabled, only commands from the adminsitrator are accepted. Many a times, when I upgrade a cluster and restart the job-tracker, user's cron-scripts immediately start submitting jobs to the tracker. Ideally, i would like the administrator to submit a few validate-cluster-kinda jobs before I open the cluster to general users.
I think that the running time should continue to be just finish - start time. The JT bounce is still wall clock time and it will be less confusing to include it than subtracting it out.
Attaching a tested patch with the test case. I need to test it more. There are still some issues with the patch.
1) Reduce progress seems to be 99.* even though the job is complete. Not sure if its related to the patch. 2) Test case shows exception on success. 3. Clean up the patch w.r.t comments, refactoring and optimizations Note that a temporary fix for One more issue that needs to be addressed is trashing . When the jobtracker restarts, it will recover logged tasks and schedule the rest. Some trackers that join early might get a task which is running on a tracker that has not yet joined. Under such a case both the attempts will run in parallel and the task that finishes first will kill the other. The problem with this is that the slots will be wasted. Also this will add to the job runtime if the tasks are long running. Some delay in opening the scheduling window might help. It looks like a minor issue for now and can be handled in a separate issue.
I like Amar's proposal. It helps administration as well. Many-a-times when we restart the jobtracker, we would like all task-trackers to check in before the job-tracker starts running jobs. HDFS has something called "safe-mode" that is automatically triggered when the namenode restarts. A similar thing wold be very helpful for the JobTracker. IT allows an administrator to restart a cluster, and then restart other pieces of software (e.g. etl pipelines) before opening the cluster to run jobs. I would like to open a seperate JIRA for doing this one.
Also, it would be nice to stop the JT from taking new submissions so the administrator can allow the running jobs to complete before upgrading the cluster. Even though we can persist these jobs, if there's a big upgrade or something else, the admin may want to drain all running jobs before stopping the cluster. Attaching a patch that has the following modifications
1) Renamed some of the parameters in JIP to maintain consistency 2) Whether or not the job recovery was successful or not is now displayed on the web-ui and JIP is changed accordingly 3) There was a bug in the earlier patch. In case of an incomplete TIP, the earlier patch ignored the attempts under it. This can be problematic if the attempt under the TIP has failed and the TIP remains in RUNNING state. This can lead to changed task-completion-events. 4) JIP has two failedTask() apis. One of which fails a already completed attempt and logs to the jobhistory. Earlier patch ignored this log. The new patch calls both the failedTask() apis in the way it should be called. 5) Testcase now includes the check for job execution ordering. It checks if the order in which the jobs we executed before the JT got killed is same as that upon restart. 6) Currently, the submit time of a JIP is the time when the constructor is invoked which is not true with restart. Hence earlier patch changed the submit time upon restart but did not log it to the history. So upon the second restart the previous submit time was lost. Added a api to log job submit time. 7) Added hashOf() wherever equals() was overridden to take care of findbugs warning. Tested on 500 nodes and work fine. Will continue testing. Ant test failed on 2 tests namely TestMapRed and TestLocalJobControl while findbugs warning were related to empty-catch-block and make-static-inner-class types. Investigating into it. Dhruba, Pete, your use-cases seem reasonable to me. Can you please open a new JIRA so they wouldn't be lost and add a comment with the number here ?
Dhruba opened Got an offline review from Devaraj. Comments are as follows :
1) isJobName could be named isJobNameValid 2) isJobDirClean could be named isJobDirValid 3) Revert the changes w.r.t Jobtracker state 4) The check for job.hasRestared() is redundant in RecoveryManager.recover() 5) JIP constructor should not do recovery. Refactor it and call from Jobtracker 6) Revert changes to job id 7) Revert changes to history filename and use filters for detecting the filename 8) Upon recovery, Jobtracker should not wait for all the jobs to be initialized. It should check if there is any history (w.r.t attempts) available and only then should wait for the initialization 9) Use job-priority for detecting if there are job level (meta) information available instead of num-attempts found/recovered 10) Counter should be logged at TIP level rather than attempt level 11) Remove Tasktracker-hostname logging and user tracker-name to find out the hostname. 12) Existence of an attempt should not be known to the Jobtracker. Move the logic back to job and tip. Vivek suggested to do Job.init() inside the jobtracker's recovery process and update the scheduler. Attaching a patch that incorporates Devaraj's comments. Testing in progress. Other changes are as follows
1) JobInProgress now has no notion of restarts and hence the web-ui remains unchanged. Its upto JobHistory to take care of restart and manage existing files. 2) RecoveryManager and Listener are cleaned and are now simple. Note that a working solution for
I think it is better to submit the solution of Also, I think it will be good to have a configuration variable to disable job recovery. This is to provide backwards compatibility, and also gives a way to disable this feature if required. There is a small bug in the patch.
JobHistory.java final Pattern historyFilePattern = Pattern.compile(JOBTRACKER_UNIQUE_STRING + "[0-9]+" + "_" + id.toString() + "_" + user + "_" + jobName + "+"); should be JobHistory.java final Pattern historyFilePattern = Pattern.compile(jobtrackerHostname + "_" + "[0-9]+" + "_" + id.toString() + "_" + user + "_" + jobName + "+"); as JOBTRACKER_UNIQUE_STRING itself contains a (new) timestamp and hence old history files wont match the pattern. Will upload a modified patch soon. Attaching a new patch.
Additions : 1) adds the switch (mapred.jobtracker.restart.recover) in hadoop-default.xml to turn on/off the recovery upon restart. By default the switch is switched off 2) some bug fixes 3) the test case now is improved to include
4) This patch now assumes Test :
Known issues : Todo : I'd like to see the history parsing and JobInProgress object creation/update happening inline. In the current patch, the JT memory requirement is doubled during recovery as well as the recovery time is doubled since you do two passes over the history data.
Remove the check for hasRestarted() in the recover method. That will be always true. Attaching a patch that incorporates Devaraj's comments. Following are the changes :
1) Listener is now stateless and inline. Tasks that are created but not finished are killed using ExpiryLaunchingTasks thread. 2) Counters are also logged as part of a successful attempt. This helps in making the listener stateless. 3) A dumb implementation of counter recovery is implemented in this patch and the tests ignore the counters while testing the equality of TaskReports. 4) The test case now test the effect of lost tracker while the jobtracker is down. 5) A job's state changes to SUCCEEDED/FAILED/KILLED only after garbage collect since garbage collect finalizes the recovery process. One comment on the patch.
Approach : The way history renaming is done in this patch is as follows
Problem : With trunk, only 1 dfs access is made while starting the log process for a job. With this patch there will be 4 dfs accesses
I think it makes more sense to create a new job file upon every restart. Before starting the recovery process, delete all the history files related to the job except the oldest file. Note that the history filename has timestamp in it so that detecting the oldest file will now easy. Example : Note that at a given time there will be at the max 2 history files per job. Overall it looks good to me..
A few comments: 1) Make the check for status as RUNNING explicitly in JobTracker.RecoveryManager.JobRecoveryListener.checkAndInit 2) Rename the variable 'cause' in JobHistory.Task.LogFailed failedDueToAttempt 3) Call JobInProgressListener.jobUpdated after the job recovery 4) ReduceTask need not check for copied maps upon restart as copyOutput already does it. Overall, this patch should be tested thoroughly under various conditions like map partially complete, reduces partially complete, the values for the counters being consistent across restarts, etc.
Done
Done
Done
Done
Updated the test case to make sure that the reducer sees map events (~50%) before killing the jobtracker. This will now test the rollback logic as the test case checks if the order in which the events are available at the tasktracker is same as the order available at the (restarted) jobtracker. I will manually test the patch to see the job-level counters, that are reconstructed from the history, match across restarts. I am waiting for Attaching a patch that incorporates Devaraj's comments.
Changes are as follows :
I have manually tested the patch w.r.t counter recovery and update. Some counters are recovered using
Attaching a patch that is updated to trunk. Following are the changes
1) There seems to be a bug in RawLocalFileSystem, see [here|HADOOP-4167]. A simple fix is to synchronize all such accesses to history directory in JobHistory. Clearly this solution wont work if some external process changes/deletes files from jobhistory directory. The attached patch synchronizes all the apis that might get invoked simultaneously. 2) JobHistory.parseHistoryFromFS() allows incomplete log lines to be passed to the listener. This is an issue for RecoveryManager as partial updates might lead to inconsistent state. One way to overcome this is to have a LINE_DELIMITER just to indicate that the line is complete. The only drawback with this approach is that its a backward incompatible change where the new parser (with line-delimiters) wont be able to parse old jobhistory files. Will open a jira to address this. For now I have implemented the LINE_DELIMITER approach. Attaching a patch updated to trunk.
Attaching a patch the fixes some findbugs/javadoc warnings. Following are the changes
1) [Counter|Group|Counters].contentEquals() are now synchronized. 2) Access to JobInProgress.launchTime is now synchronized 3) Fixed javadoc warning in JobHistory.java. test-patch passes the core tests on the local machine with few warnings. The attached patch fixes those warnings. Thanks Amareshwari for testing it for me.
Attaching a new patch that fixes a small bug in the previous patch. With
Ant test failed on the following test
Tested this patch on 100 nodes and restart work fine. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390114/HADOOP-3245-v5.36.3-no-log.patch against trunk revision 695604. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 22 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3264/testReport/ This message is automatically generated. I just committed this. Thanks, Amar (this was a long journey indeed)!
With a happy ending, I hope ! smile. Thanks Devaraj for the multiple reviews and improvements. Nice work folks! I am eagerly waiting to deploy 0.19 so that I can get this feature. Thanks guys.
I am finding the new TestJobTrackerRestart* tests take a very long time to run on my machine, upward of 10 minutes...if I'm doing anything else on the system at the same time the tests time out and get killed.
-how long does the test run take for other people on their desktop? if thes test does take a long time, is it possible to somehow group this into the 'slow test' category; otherwise my test cycle has just been extended by 50%. Steve,
The runtime of these tests are mainly because of the number of maps they run which is 50. The reason for running 50 maps is for having sufficient history size so that recovery can be tested. We can probably reduce the wait time to 1sec or so. I will check if I can reduce the test time and keep you guys posted. Integrated in Hadoop-trunk #611 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/
Integrated in Hadoop-trunk #618 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/618/
Integrated in Hadoop-trunk #640 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/640/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
One requirement that we felt should be addressed is: if a job is partially complete when the JobTracker restarts, the user would expect to get all information about the completed tasks of this job transparently.
To address this requirement, we would need to persist information about every completed task. To solve this, we can probably take an approach similar to what is followed by the NameNode edit logs mechanism. We could have a master image file that stores a snapshot of the current state of a running job. When tasks of the job change state, we could store the update immediately to a log. Periodically, we could merge these updates to the master image file.
An alternative approach would be to update the image file periodically, batching updates. However in the interests of scale, and considering there may be frequent updates to tasks, we felt the earlier approach is a better one.
Another point we considered was whether we could store this information to DFS, similar to
HADOOP-1876. However, given that we don't have appends, and also that the a JobTracker restart may make us lose information written to this file, we feel that may not work very well.Comments ?