To split sequencefile as user requested size, there's no way to avoid read/write records. I think we have to use just blockSize.
Correct, we have to split via the blocks.
Unlike MapReduce, we are unable to queuing tasks when exceeds cluster capacity (I have no idea at the moment).
There is no idea to have, we have to restrict more tasks than the cluster capacity. In YARN this issue is even worse, because you don't know the capacity.
From what I discovered so far, the first one ideally can be achieved by applying tiling strategy. Then we can provide wrapper classes for user to access according to range requested.
How is this tiling gonna work without rewriting sequence files?