Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
For the large datasets I work with, it is common to have light-weight keys and very heavy values (integers and large double arrays for example). The key values are however known and unchanging. It would be nice if Spark had a built in partitioner which could take advantage of this. A FixedRangePartitioner[T](keys: Seq[T], partitions: Int) would be ideal. Furthermore this partitioner type could be extended to a PartitionerWithKnownKeys that had a getAllKeys function allowing for a list of keys to be obtained without querying through the entire RDD.