Attached new patch that addresses the comments by Dick.
1: Should TestSimulator*JobSubmission check to see whether the total "runtime" was reasonable for the Policy?
Currently, each policy is tested as a separate test case. It may be hard to combine them and compare the virtual runtime, which is only present as console output. I did do some basic sanity check manually after the run.
2: minor nit: Should SimulatorJobSubmissionPolicy/getPolicy(Configuration) use valueOf(policy.toUpper()) instead of looping through the types?
Updated in the patch based on the suggestion.
3: medium sized nit: in SimulatorJobClient.isOverloaded() there are two literals, 0.9 and 2.0F, that ought to be static private named values.
Added final variables to represent the magic constants, and added comments.
4: Here is my biggest point. The existing code cannot submit a job more often than once every five seconds when the jobs were spaced further apart than that and the policy is STRESS .
Please consider adding code to call the processLoadProbingEvent core code when we processJobCompleteEvent or a processJobSubmitEvent . That includes potentially adding a new LoadProbingEvent . This can lead to an accumulation because each LoadProbingEvent replaces itself, so we should track the ones that are in flight in a PriorityQueue and only add a new LoadProbingEvent whenever the new event has a time stamp strictly earlier than the earliest one already in flight. This will limit us to two events in flight with the current adjustLoadProbingInterval .
If you don't do that, then if a real dreadnaught of a job gets dropped into the system and the probing interval gets long it could take us a while to notice that we're okay to submit jobs, in the case where the job has many tasks finishing at about the same time, and we could submit tiny jobs as onsies every five seconds when the cluster is clear enough to accommodate lots of jobs. When the cluster can handle N jobs in less than 5N seconds for some N, we won't overload it with the existing code.
I changed the minimum load probing interval to 1 seconds (from 5 seconds). Note that when a job is submitted, it could take a few seconds before JT assigns the map tasks to TTs with free map slots. So reducing this interval further could lead to artificial load spikes.
I also added load checks after each job completion, and if the cluster is underloaded, we submit another job (and reset the load checking interval to the minimum value). This does bring in a potential danger when many jobs happen to complete at the same time, and inject a lot of jobs into the system. But I think such risk should be fairly low and thus would not worry much about it.