I understand that you want some way to set the replication of the index files. But why the source file replication factor and the destination index file replication factor have to be the same?
jobfs.setReplication(srcFiles, repl);, The repl used to set the replication of srcFiles. But this srcFiles is not the actual source files which contains data, this is just an intermediate list of filestatuses, written as sequencefile, which will be read to generate the MR job splits, immediately after this file is created. First, HDFS will not have any time to replicate, second, there is no use of increasing the replication since it will be read in the same client and only once as part of split generation. Also srcFiles will be deleted once the Job is done.
On the other hand, actual data files, which are created from mappers as part files, have the default replication. Still the proposed patch didn't change this. Need to change this these also.
So, IMO, user specified 'replication' should be used for the resultant archive (both content and indexes), not for the intermediate file.
Also, since default replication 10, is not really used, we can change this to default replication 3 itself. and update in docs also.