Consider a query like:
select B.y, count(1) from
A join B on A.x=B.x
group by B.y;
This will require 2 MR jobs. The first MR job will perform the join, and the second MR job will perform the group by (note that the 2nd MR job would have a
identity mapper). If the first MR job could write the output of the join to a HBase table (which is keyed by B.y), the 2nd MR can be a map-only job which can
simply scan the HBase table. This idea can be extended to joins as well.