Description
xuefuz any thoughts on this? I think it would provide better out of the box behavior for Hive-on-Spark users, especially for users who are migrating from Hive-on-MR to HoS. Wondering what your experience with this config has been?
I've done a bunch of performance profiling with this config turned on vs. off, and for TPC-DS queries it doesn't make a significant difference. The main difference I can see is that when a Spark stage has to spill to disk, repartitionAndSortWithinPartitions spills more data to disk than groupByKey - my guess is that this happens because groupByKey stores everything in Spark's ExternalAppendOnlyMap (which only stores a single copy of the key for potentially multiple values) whereas repartitionAndSortWithinPartitions uses Spark's ExternalSorter which sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results in more data being spilled to disk).
My understanding is that using repartitionAndSortWithinPartitions for Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config would provide a similar experience to HoMR. Furthermore, last I checked, groupByKey still can't spill within a row group.