Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4796

Spark does not remove temp files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.1.0
    • None
    • Input/Output
    • None
    • I'm runnin spark on mesos and mesos slaves are docker containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, docker 1.2.0.

    Description

      I started a job that cannot fill into memory and got "no space left on device". That was fair, because docker containers only have 10gb of disk space and some is taken by OS already.

      But then I found out when job failed it didn't release any disk space and left container without any free disk space.

      Then I decided to check if spark removes temp files in any case, because many mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after spark task is finished. I attached with strace to running job:

      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36" <unfinished ...>
      [pid 30212] <... unlink resumed> ) = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a" <unfinished ...>
      [pid 30212] <... unlink resumed> ) = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d" <unfinished ...>
      [pid 30212] <... unlink resumed> ) = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc") = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898" <unfinished ...>
      [pid 30212] <... unlink resumed> ) = 0
      [pid 30212] unlink("/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0") = 0

      And after job is finished, some files are still there:

      /tmp/spark-local-20141209091330-48b5/
      /tmp/spark-local-20141209091330-48b5/11
      /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4
      /tmp/spark-local-20141209091330-48b5/32
      /tmp/spark-local-20141209091330-48b5/04
      /tmp/spark-local-20141209091330-48b5/05
      /tmp/spark-local-20141209091330-48b5/0f
      /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2
      /tmp/spark-local-20141209091330-48b5/3d
      /tmp/spark-local-20141209091330-48b5/0e
      /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1
      /tmp/spark-local-20141209091330-48b5/15
      /tmp/spark-local-20141209091330-48b5/0d
      /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0
      /tmp/spark-local-20141209091330-48b5/36
      /tmp/spark-local-20141209091330-48b5/31
      /tmp/spark-local-20141209091330-48b5/12
      /tmp/spark-local-20141209091330-48b5/21
      /tmp/spark-local-20141209091330-48b5/10
      /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3
      /tmp/spark-local-20141209091330-48b5/1e
      /tmp/spark-local-20141209091330-48b5/35

      If I look into my mesos slaves, there are mostly "shuffle" files, overall picture for single node:

      root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l
      781
      root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l
      10
      root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle
      /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3
      /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111
      /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705
      /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e
      /tmp/spark-local-20141119144512-67c4/24/temp_f0e4cc43-6cc9-42a7-882d-f8a031fa4dc3
      /tmp/spark-local-20141119144512-67c4/29/temp_a8bbe2cb-f590-4b71-8ef8-9c0324beddc7
      /tmp/spark-local-20141119144512-67c4/3a/temp_9fc08519-f23a-40ac-a3fd-e58df6871460
      /tmp/spark-local-20141119144512-67c4/1e/temp_d66668ab-2999-48af-a136-84cfd6f5f6cb
      /tmp/spark-local-20141205110922-f78e/0a/temp_7409add5-e6ff-46e5-ae3f-6a4c7b2ddf8f
      /tmp/spark-local-20141205111026-0b53/01/temp_72024c94-7512-4692-8bd1-ef2417143d8c

      Conclusions:

      1. Shuffle files should be removed, but they stay.
      2. Temp files should always be removed, but they stay.

      Maybe we should unlink temp and shuffle files immediately after creation to remove them even if spark fails.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bobrik Ivan Babrou
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: