Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47764

Cleanup shuffle dependencies for Spark Connect SQL executions

    XMLWordPrintableJSON

Details

    Description

      Shuffle dependencies are created by shuffle map stages, which consists of files on disks and the corresponding references in Spark JVM heap memory. Currently Spark cleanup unused  shuffle dependencies through JVM GCs, and periodic GCs are triggered once every 30 minutes (see ContextCleaner). However, we still found cases in which the size of the shuffle data files are too large, which makes shuffle data migration slow.

       

      We do have chances to cleanup shuffle dependencies, especially for SQL queries created by Spark Connect, since we do have better control of the DataFrame instances there. Even if DataFrame instances are reused in the client side, on the server side the instances are still recreated. 

       

      We might also provide the option to 1. cleanup eagerly after each query executions, or 2. only mark the shuffle executions and do not migrate them at node decommissions.

      Attachments

        Activity

          People

            bozhang Bo Zhang
            bozhang Bo Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: