My real worry with your approach is that the average number of elements per row of A is likely to be comparable to p+k. This means that Y = A \Omega will be about as large as A. Processing that sequentially is a non-starter and the computation of Q without block QR means that Y is processed sequentially. On the other hand, if we block decompose Y, we want blocks that fit into memory because that block size lives on in B and all subsequent steps. Thus, streaming QR is a non-issue in a blocked implementation. The blocked implementation gives a natural parallel implementation.
I think you misunderstanding it a little. the actual implementation is not that naive. let me clarify.
First, there is blocking. More over, it's a hierarchical blocking.
the way it works, you specify block height, which is more k+p but ideally less than a MR split would host (you can specify more but you may be producing some network traffic then to move non-collocated parts of the split). Blocks are considered completely in parallel. Hence, initial parallelizm degree is m/r where r is average block height. They can (and are) considered independently, among the splits. "thin streaming QR" runs inside the blocks, not on the whole Y.
Secondly, Y matrix, or even its blocks, are never formed. What is formed is shifting intermediate Q buffer of the size (k+p)xr and intermediate upper triangular R of size (k+p)x(k+p). Since they are triangular, there's a rudimental implementation of Matrix itnerface called UpperTriangular not to waste space on lower triangle but still allow random access.
Thirdly, the hierarchy. when we form Q blocks, we will have to update them with Givens operations resulting from merging R matrices. This is done in combiner and this comes very natural to it. If there's say z blocks in a mapper then Q1 goes thru updates resulting from z merges of R, Q2 goes thru udpates resulting from z-1 merges and so on. Nothing being concatenated (or unblocked) there except for the R sequence (but it is still sequence, that is sequentially accessed thing) which i already provided memory estimates for. Most importantly, it does not depend on the block height, so you can shrink R sequence length if you have higher Q blocks, but higher Q blocks also take more memory to process at a time. there's a sweet spot to be hit here with parameters defining block height and split size, so it maximizes the thruput. for k+p=500 i don't see any memory concerns there in a single combiner run.
And there's no reducer (i.e. any sizable shuffle and sort) here. At the end of this operation we have a bunch of Rs which corresponds to the number of splits, and a bunch of interbediate Q blocks still same size which correspond to number of Q-blocks.
Now we can repeat this process hierarchically with additional map-only passes over Q blocks until only one R block is left. with 1G memory, as i said, my estimate is we can merge up to 1000 Rs per combiner with one MR pass (without extra overhead for single Q block and other java things). (in reality in this implementation there are 2 levels in this hierarchy which seems to point to over 1 bln rows, or about 1 mln Q blocks of some relatively moderate height r>>k+p, but like i said with just one map-only pass one can increase scale of m to single trillions ). This hierarchical merging is exactly what i meant by 'making MR work harder' for us.
There is a poor illustration of this hierarchical process in the doc that makes it perhaps more clear than words.
Also let me point out that the fact that the processes involved in R merging are map-only, which means that if we play the splitting game right in MR, there would practically be no networking IO per MR theory. This is very important imo for such scale. The only IO that occurs is to 'slurp' r sequences from HDFS before next stage of hierarchical R-merge. For a sequence of 1000 R, k+p 500, the size of R, dense and uncompressed, is approximately 1 mb each, so for a sequence of thousand Rs, the size of such slurp IO, dense and uncompressed, would be 1G, which is less than what i am having today in a single step with Pig for a 200k of proto-packed log records today in production and that finishes in a minute.
Bottom line, let's benchmark it. So we don't have to guess. Especially if we can do A vector streaming. I am personally having trouble with logistics of this so far, as i mentioned before. I will get to benchmarking it sooner or later. Important thing for me at this point was to make sure it does correct computation (which it does) and make educated guess about the scale (which is billion by million without vector streaming support or billion to gazillion with vector streaming support, with potential to extend m scale thousand times with each additional map-only pass over Q data (which is (k+p)xm which is again unbounded for n).