Hive already have support for map-join. Map-join treats the big table as job input, and in each mapper, it loads all data from a small table.
In case the big table is already bucketed on the join key, we don't have to load the whole small table in each of the mappers. This will greatly alleviate the memory pressure, and make map-join work with medium-sized tables.
There are 4 steps we can improve:
S0. This is what the user can already do now: create a new bucketed table and insert all data from the small table to it; Submit BUCKETNUM jobs, each doing a map-side join of "bigtable TABLEPARTITION(BUCKET i OUT OF NBUCKETS)" with "smallbucketedtable TABLEPARTITION(BUCKET i OUT OF NBUCKETS)".
S1. Change the code so that when map-join is loading the small table, we automatically drop the rows with the keys that are NOT in the same bucket as the big table. This should alleviate the problem on memory, but we might still have thousands of mappers reading the whole of the small table.
S2. Let's say the user already bucketed the small table on the join key into exactly the same number of buckets (or a factor of the buckets of the big table), then map-join can choose to load only the buckets that are useful.
S3. Add a new hint (e.g. /*+ MAPBUCKETJOIN(a) */), so that Hive automatically does S2, without the need of asking the user to create temporary bucketed table for the small table.