Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44759

Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large



    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
    • None
    • Deploy, Optimizer, Spark Core
    • None
    • Important


      This is an issue since the WSCG  implementation of the generate node. 

      Because WSCG compute rows in batches , the combination of WSCG and the explode operation consume a lot of the dedicated executor memory. This is even more true when the WSCG node contains multiple explode nodes. This is the case when flattening a nested array.

      The generate node used to flatten array generally  produces an amount of output rows that is significantly higher than the input rows.

      the number of output rows generated is even drastically higher when flattening a nested array .

      When we combine more that 1 generate node in the same WholeStageCodeGen  node, we run  a high risk of running out of memory for multiple reasons. 

      1- As you can see from snapshots added in the comments ,  the rows created in the nested loop are saved in a writer buffer.  In this case because the rows were big , the job failed with an Out Of Memory Exception error .

      2_ The generated WholeStageCodeGen result in a nested loop that for each row  , will explode the parent array and then explode the inner array.  The rows are accumulated in the writer buffer without accounting for the row size.

      Please view the attached Spark Gui and Spark Dag 

      In my case the wholestagecodegen includes 2 explode nodes. 

      Because the array elements are large , we end up with an Out Of Memory error. 


      I recommend that we do not merge  multiple explode nodes in the same whole stage code gen node . Doing so leads to potential memory issues.

      In our case , the job execution failed with an  OOM error because the the WSCG executed  into a nested for loop . 





        1. image-2023-08-10-09-27-24-124.png
          28 kB
          Franck Tago
        2. image-2023-08-10-09-29-24-804.png
          64 kB
          Franck Tago
        3. image-2023-08-10-09-32-46-163.png
          139 kB
          Franck Tago
        4. image-2023-08-10-09-33-47-788.png
          67 kB
          Franck Tago
        5. wholestagecodegen_wc1_debug_wholecodegen_passed
          83 kB
          Franck Tago



            Unassigned Unassigned
            tafranky@gmail.com Franck Tago
            0 Vote for this issue
            2 Start watching this issue