Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20141

Turn hive.spark.use.groupby.shuffle off by default



    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Spark
    • None


      xuefuz any thoughts on this? I think it would provide better out of the box behavior for Hive-on-Spark users, especially for users who are migrating from Hive-on-MR to HoS. Wondering what your experience with this config has been?

      I've done a bunch of performance profiling with this config turned on vs. off, and for TPC-DS queries it doesn't make a significant difference. The main difference I can see is that when a Spark stage has to spill to disk, repartitionAndSortWithinPartitions spills more data to disk than groupByKey - my guess is that this happens because groupByKey stores everything in Spark's ExternalAppendOnlyMap (which only stores a single copy of the key for potentially multiple values) whereas repartitionAndSortWithinPartitions uses Spark's ExternalSorter which sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results in more data being spilled to disk).

      My understanding is that using repartitionAndSortWithinPartitions for Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config would provide a similar experience to HoMR. Furthermore, last I checked, groupByKey still can't spill within a row group.




            stakiar Sahil Takiar
            stakiar Sahil Takiar
            0 Vote for this issue
            3 Start watching this issue