Currently, the split that a mapper processes is determined by a variety of parameters, including the dfs block size, min split size etc.
It might be useful to have an option when the users wants a mapper so scan 1 file. This will be specially useful for sort-merge join.
If the data is partitioned into various buckets, and each bucket us sorted, the sort merge join can join the different buckets together.
For example, consider the following scenario:
table T1: sorted and bucketed by column 'key' into 1000 buckets
table T2: sorted and bucketed by column 'key' into 1000 buckets
and the query:
select * from T1 join T2 on key
Instead of joining the table T1 with T2, the 1000 buckets can be joined with each other individually.
Since the data is sorted on the join key, sort-merge join can be used.
Say the buckets are named: b0001, b0002 .. b1000
Say table T1 is the big table, and the buckets from T2 are being read as part of the mapper which is spawned to process T1,
under the current approach, it will be very difficult to perform outer joins.
For example, if bucket b1 for T1 contains:
and the corresponding bucket for T2 contains:
If there are 2 mappers for bucket b1 for T1, processing 4 records each ((1,2,5,6) and (22.214.171.124) respectively.
It will be very difficult to perform a outer join. The mapper will need to peek into the previous record
and the next record respectively.
Moreover, it will be very difficult to ensure that the result also has 1000 buckets. Another map-reduce job
will be needed for the same.
This can be easily solved if we are guaranteed that the whole bucket (or the file corresponding to the bucket),
will be processed by a single mapper.