If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
Here's a simple setup that can exhibit the problem. Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
Examining the map task log shows this:
In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job. Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
Fixing this brings up the bigger question of what should happen when a streaming job doesn't consume all of its input. Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job? If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below. It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs. Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior: