added a comment - Version 1
The following is a list of features we plan to develop that distinguish it from its predecessors:
Workload is generated from job history trace analyzer Rumen ( MAPREDUCE-751 ).
Model the details of IO workload:
Much larger working set.
Input and output data volumes for every task.
Input and output record counts for every task.
Model the details of memory resource usage.
Model the job submission rates
The main purpose of Gridmix is to evaluate MapReduce and HDFS performance. It will not, and is not intended to, capture improvements to layers on top of MapReduce. As its predecessors were, Gridmix will be principally a job submission client (though collecting high-level metrics- as was attempted in Gridmix2- would be a natural extension) will be developed in several stages. A usable and useful V1.0 of the submitting client must satisfy the following requirements in each functional categories listed below.
The context for the following list is unpacked in subsequent sections. Briefly, the job properties that will not be closely reproduced:
CPU usage . We have no data for per-task CPU usage, so we cannot attempt even an approximation (e.g. tasks should never be CPU bound, though this surely happens in practice).
Filesystem properties. We will make no attempt to match block sizes, namespace hierarchies, or any property of input, intermediate, or output data other than the bytes/records consumed and emitted from a given task.
I/O rates. Related to CPU usage, the rate at which a given tasks consumes/emits records is assumed to be 1) limited only by the speed of the reader/writer and 2) constant throughout the job.
Memory profile. No information on tasks' memory use is available, other than the heap size requested. Given data on the frequency and size of allocations, a more thorough profile could be compiled and reproduced.
Skew. We assume records read and output by a task follow observed averages. This is optimistic, as fewer buffers will require resizing, etc. Further, we assume that each map generates a proportional percentage of each reduce's input.
Job failure. We assume that framework and hardware faults are the only cause of task failure, i.e. user code is correct.
Job independence. The output or outcome of one job does not affect when or whether another job will run.
Data Generation. When run on a cluster without sufficient data- a common case for testing new versions of Hadoop- the submitting client should have a data generation phase to populate the cluster with a configurable volume of data.
Reasonable distribution. We do not yet have sufficient data to generate a simulated block map. In the future, we may discover that this requires some processing of/from the Rumen trace to write data similar to that of the actual jobs. For now, we require only that the input data are distributed more-or-less evenly across the cluster, to avoid artifical hot-spots in the Gridmix data.
Content. Since the map will employ a trivial binary reader, there are no constraints on the generated data. Future versions of Gridmix could produce compressible or compressed data- or even mock data to be consumed by a non-trival reader- but the only requirement we impose in V1.0 is that the generated data are random (i.e. not trivially compressible).
Namespace. The generated data will make no attempt to emulate the block size, file hierarchy, or any other property of the data over which the user jobs originally ran. The generated data will be written into a flat namespace.
Generic MR job. Gridmix must have a generic job to effect the workloads in the trace it receives from Rumen. For the first version, our focus will be on reproducing I/O traffic and memory.
Task configuration. The number of maps and reduces must match the Rumen trace description.
Memory usage. Lacking explicit data on actual memory usage in task JVMs, we assume that each task will consume and use most of the heap allocated to it. This is a pessimistic estimate- since it is expected that most users accept the default setting and use considerably less- but it will be predictable.
Map fidelity. Each map will read the number of records and bytes specified in the Rumen trace. It is assumed that the map will output its records at an even rate, so records will be output at a rate from the map- per input record- that will effect the correct number of output records. Without data on the record read rate (and CPU usage), we cannot simulate effort expended per record, but the map itself will do some nominal work. We do not control block size, so if the data read in the original user job had a different block size, then the percentage of remote and local reads will likely be affected.
Shuffle fidelity. We do not have sufficient data to describe skewed task output, e.g. a job with more than one reduce, such that at least one reduce receives most of its input from a single map task. Instead, we will assume that each map generates a percentage of bytes and records equal to that received by a given reduce. For example, if reduce r receives i% of the input records and j% of the bytes in the shuffle, then map m will output i% of its output records such that j% of its bytes go to reduce r. This will, with some error, match the input record and byte counts to each reduce. Note that combiners in users jobs will affect byte and record counts we consider, but no combiner will be run. Whether Rumen reports the shuffled bytes/records as the map output- or the bytes output from the map prior to the combiner- will not be distinguished by the Gridmix submission client.
Reduce fidelity. Each reduce will- by definition- receive the set of bytes and records assigned to it from each map. As in the map, it will output at a constant rate, relative to the input records, such that it will output the correct number of records and bytes as described in the Rumen trace.
Input parameters. The Gridmix client will accept a Rumen trace (either from a URI or stdin) of jobs, work directory, and (optional) the size of the data to be generated. One data generating task will be started per TT node; each will generate the same amount of data.
Submission. Given that Gridmix is simulating a set of submitting clients, it must be possible to submit jobs concurrently in case split generation would otherwise cause a deadline to be missed. (Un)Fortunately, scalability limitations at the JobTracker impose an upper bound on the submission rate the client can effect. Errors in submission must be reported at the client.
Cancellation and shutdown. It should be possible to cancel a run of Gridmix at any time without leaving jobs running on the cluster. Gridmix must make a best-effort to kill any running jobs it has started before exiting through a shutdown hook.
Monitoring. Though V1.0 will not attempt to monitor job dependencies, and will merely submit each job at its appointed deadline, Gridmix must monitor the jobs it has submitted and perform success/failure callbacks for every completed job. For V1.0, it is assumed that Rumen will provide all necessary intelligence for processing the run at the JobTracker.
There are a number of ideas and approaches we elect to defer until either the need is demonstrated or more time is available for implementation. Among them:
Job throttling. Rather than attempt any sort of backoff at the client in response to JobTracker load, we take the position that a trace is an inviolate submission profile. This allows us to make comparisons between two runs of the same trace. However, if this is to replace GridMix and GridMix2 as the de facto load generator or if one wants to throttle submission on evidence that it interferes with some unrelated measurement, such throttling may become attractive.
Compression. The compressibility of input, intermediate, and output data can be mined from task logs. It would be possible to generate data that would roughly matched the observed "compressibility" of user jobs, but this would require a targeted effort.
Designed jobs. Given that MapReduce has a range of widely used libraries- whose development is a considerable portion of JIRA traffic- tracking improvements to this core functionality in proportion to their prevalence in user jobs would aid in identifying bottlenecks in the application layer. Support for specifying InputFormats, OutputFormats, and even commonly used Mappers and Reducers would augment the synthetic mix.
Failure. Gridmix assumes that failures are caused by hardware or flaws in the framework, so it makes no attempt to inject failures into a particular run. Since this does not match some of the preliminary insights into the causes of task failure, it may be argued that the Rumen trace should include- and Gridmix should honor- application failures. Tracing where in the input the failure occurs- and which component fails- would be a way to characterize and measure approaches to mitigating resource waste. In particular, one may improve the proportion of "useful" work- and thus throughput- with a better strategy for punishing/throttling jobs that will ultimately fail.
Job dependencies. Modeling dependencies between jobs would remove a lower bound on job throughput by starting dependent jobs immediately once their prerequisites have been met. In general, one can imagine a submission client that would accept a list of prerequisites of which a submission time is merely the most common.
Isolation. We assume that Gridmix "owns" the cluster it's running on, i.e. the only jobs running on the cluster are those submitted by an instance of the client. Supporting load generation alongside user tasks would be another use for throttling job submission (and a possible alternative to very fine simulations of specific user tasks).