Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded memory. For orderBy, Hive accumulates key groups using ArrayList (described in
HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, which has a shortcoming of not being able to spill to disk within a key group. Thus, for large key group, memory usage is also unbounded.
It's likely that this will impact performance. We will profile and optimize afterwards. We could also make this change configurable.