This looks like a consistency problem; s3 listing always lags the creation/deletion/update of contents.
the committer has listed paths to merge in, then gone through each one to see their type: if not a file, lists the subdirectory, and, interestingly gets an exception at this point. Maybe the first listing found an object which is no longer there by the time the second listing went through, that is, the exception isn't a delay-on-create, its a delay-on-delete.
create listing delays could be handled in the committer by having a retry on an FNFE; it'd slighly increase the time before a failure, but as that's a failure path, not too serious; delete delays could be addressed the opposite: ignore the problem, on the basis that if the listing failed, there's no file to rename. That's more worrying as it's a sign of a problem which could have implications further up the commit process: things are changing in the listing of files being renamed.
HADOOP-13345 is going to address list inconsistency; I'm doing a committer there which I could also try to make more robust even when not using a dynamo-DB backed bucket. Question is: what is the good retry policy here, especially given once an inconsistency has surfaced, a large amount of the merge may already have taken place. Backing up and retrying may be differently dangerous.
One thing I would recommend trying is: commit to HDFS, then copy. Do that and you can turn speculation on in your executors, get the local Virtual HDD perf and networking, as well as a consistent view. copy to s3a after all that you want done is complete.