I recently played with the repartitionByRange method for DataFrame introduced in
SPARK-22614. For DataFrames larger than the one tested in the code (which has only 10 elements), the code sends back random results.
As a test for showing the inconsistent behaviour, I start as the unit code used to test repartitionByRange (here) but I increase the size of the initial array to 1000, repartition using 3 partitions, and count the number of element per-partitions:
I do not know whether it is expected (I will dig further in the code), but it sounds like a bug.
Or I just misinterpret what repartitionByRange is for?