Good point regarding the buffersize. However, the problem with the buffersizes is that it cannot be completely disabled without, probably, major changes in the FileSystem part of the framework. This is because things like checksum depend on the buffersize. As an instance, the minimum buffersize should be at least as much as io.bytes.per.checksum, otherwise the checksum algo won't work, etc.... So, changing the buffersize, although it makes good sense for the ramfs, doesn't guarantee we won't run into the OutOfMemory errors when we increase the number of maps from, lets say, 9000 to 18000 (and all outputs can still fit in the ramfs). One more thing worth noting here is that we will have ramfs, Pririty Queue, and other merge datastructures proportional to the number of files we will merge at once.
So, chose a middle ground for this. In the merge code, used the minimum buffersize which is same as the value of io.bytes.per.checksum for the case where merge is trying to open a file in the ramfs. Also, added a config value mapred.inmem.merge.threshold; it has a default value of 0 which signifies that we DON'T want to have the file-count-threshold-based merging. If we want it (coz of OutOfMemory errors), then configure some appropriate value for that (like 5000 or so). This should give optimal performance for the typical use cases.