There already is a seek() so that >1 mapper can read off different parts of the same S3 file, after that initial GET to read in the file header -using offsets But that file header is needed to
- determine the length of the blob
- meet the standard expectation "open() fails if the file isn't there"
were it not for #2, we could delay the open until the first read & so save one round trip (more relevant long-haul than in-EC2), but people don't expect that.
What S3n does do is pretend that there is a block size for the data, so that the splitter can split up a file by blocks, handing each block off to a different mapper. You can configure this with "fs.s3n.block.size"; it defaults to 64 MB -but you are free to make it smaller or larger.
Even if you run 60 mappers against a 4GB file, the bandwidth you will get off an S3 blob won't be 60x that of a single mapper. S3 doesn't do replication the way HDFS does, where the bandwidth is O(blocks*3). For S3 it is O(1). What does that mean? It means that you won't get any speedup at the map phase, though the different no. of mappers may make things better/worse at reduce time.