Your patch does not have the right fix for this problem. If you look closely, the first fragment of replFiles is not added to distributed cache and hence we should not add all of replFiles. The reason first patch works is because at the time this check is made replFiles[i] is null hence it is skipped. Also, since you are not recursively listing the files, you do not hit the problem we hit in the second patch (hidden _logs directory). The reason second patch test case shows big numbers is because of hidden _logs directory in the replicated data path. I fixed the Utils.getPathLength method to ignore the hidden files. This also affects reducer estimation code path by ignoring hidden directories while listing input, which I feel is right way to go.