The previous versions of this attachment missed one point.
The basic problem is that with the existing code base the progress is based on the records read from the input split, but there is buffering in the way pipes works. This makes the tasks appear to have made more progress than they deserve to have made, in jobs where the input splits are small.
To make speculation work under pipes with small input splits, two conditions have to be met:
1: The pipes code has to have an API to report progress, and has to use it. The old patch met this goal. You incant (&context)->serProgress(float) within HadoopPipes::Mapper.map(HadoopPipes::MapContext& context) . This does require that you have a way of measuring progress,which I consider likely because this is only needed when the input splits are small, which implies that the "input data" is really a signal to get the real data somewhere else [or to generate it].
2: The job has to be able to say that the progress that would otherwise be inferred from input split reads has to be ignored. This newest version of the patch does that; you can either call JobConf.setRecordReaderProgressDisabled(true), or set the attribute mapred.job.disable.record.reader.progress to true .
This patch addresses the second point. I did not mark it available because it needs a forward port. I attached it to this issue for comments, and for the record.