Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5345

IOManager failed to properly clean up temp file directory

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.3
    • Fix Version/s: 1.2.0, 1.3.0
    • Component/s: None
    • Labels:

      Description

      While testing 1.1.3 RC3, I have the following message in my log:

      2016-12-15 14:46:05,450 INFO  org.apache.flink.streaming.runtime.tasks.StreamTask           - Timer service is shutting down.
      2016-12-15 14:46:05,452 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: control events generator (29/40) (73915a232ba09e642f9dff92f8c8773a) switched from CANCELING to CANCELED.
      2016-12-15 14:46:05,452 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Source: control events generator (29/40) (73915a232ba09e642f9dff92f8c8773a).
      2016-12-15 14:46:05,454 INFO  org.apache.flink.yarn.YarnTaskManager                         - Un-registering task and sending final execution state CANCELED to JobManager for task Source: control events genera
      tor (73915a232ba09e642f9dff92f8c8773a)
      2016-12-15 14:46:40,609 INFO  org.apache.flink.yarn.YarnTaskManagerRunner                   - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
      2016-12-15 14:46:40,611 INFO  org.apache.flink.runtime.blob.BlobCache                       - Shutting down BlobCache
      2016-12-15 14:46:40,724 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@10.240.0.34:33635] has failed, address is now gated for [5000] ms.
       Reason is: [Disassociated].
      2016-12-15 14:46:40,808 ERROR org.apache.flink.runtime.io.disk.iomanager.IOManager          - IOManager failed to properly clean up temp file directory: /yarn/nm/usercache/robert/appcache/application_148129128
      9979_0024/flink-io-f0ff3f66-b9e2-4560-881f-2ab43bc448b5
      java.lang.IllegalArgumentException: /yarn/nm/usercache/robert/appcache/application_1481291289979_0024/flink-io-f0ff3f66-b9e2-4560-881f-2ab43bc448b5/62e14e1891fe1e334c921dfd19a32a84/StreamMap_11_24/dummy_state does not exist
              at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1637)
              at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
              at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
              at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
              at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
              at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
              at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
              at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
              at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
              at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
              at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
              at org.apache.flink.runtime.io.disk.iomanager.IOManager.shutdown(IOManager.java:109)
              at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.shutdown(IOManagerAsync.java:185)
              at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync$1.run(IOManagerAsync.java:105)
      
      

      This was the last message logged from that machine. I suspect two threads are trying to clean up the directories during shutdown?

        Activity

        Hide
        StephanEwen Stephan Ewen added a comment -

        Anton Solovev Sorry for the confusion in this issue.

        Show
        StephanEwen Stephan Ewen added a comment - Anton Solovev Sorry for the confusion in this issue.
        Hide
        StephanEwen Stephan Ewen added a comment -

        Fixed in

        • 1.2.0 via d1b86aab09061627d8b8c8f99b4277cc60e3dc28
        • 1.3.0 via c4626cbae074ba288e54308c40f93258e14c9667
        Show
        StephanEwen Stephan Ewen added a comment - Fixed in 1.2.0 via d1b86aab09061627d8b8c8f99b4277cc60e3dc28 1.3.0 via c4626cbae074ba288e54308c40f93258e14c9667
        Hide
        tonycox Anton Solovev added a comment - - edited

        Stephan Ewen I see, will assign the issue on you

        Show
        tonycox Anton Solovev added a comment - - edited Stephan Ewen I see, will assign the issue on you
        Hide
        StephanEwen Stephan Ewen added a comment -

        The utility I did is a bit nicer. deleteQuietly aborts the deletion on first error (file not found), simply swallowing the exception.
        The new utility continues to remove the other files, which is what I think we want.

        Show
        StephanEwen Stephan Ewen added a comment - The utility I did is a bit nicer. deleteQuietly aborts the deletion on first error (file not found), simply swallowing the exception. The new utility continues to remove the other files, which is what I think we want.
        Hide
        tonycox Anton Solovev added a comment -

        Robert Metzger Stephan Ewen How about using #deleteQuietly and if a folder can't be deleted - log a warn. Because it's a temporary file so theoretically the directory sooner or later will be deleted by one of the another thread.

        Show
        tonycox Anton Solovev added a comment - Robert Metzger Stephan Ewen How about using #deleteQuietly and if a folder can't be deleted - log a warn. Because it's a temporary file so theoretically the directory sooner or later will be deleted by one of the another thread.
        Hide
        StephanEwen Stephan Ewen added a comment -

        I have a "concurrency-safe" deletion function in a different branch.Will merge that as a base for this..

        Show
        StephanEwen Stephan Ewen added a comment - I have a "concurrency-safe" deletion function in a different branch.Will merge that as a base for this..
        Hide
        rmetzger Robert Metzger added a comment -

        Anton Solovev What's the progress on fixing this issue?

        Show
        rmetzger Robert Metzger added a comment - Anton Solovev What's the progress on fixing this issue?
        Hide
        StephanEwen Stephan Ewen added a comment -

        I think Robert was referring to 1.1.4 RC3.
        The issue also applies to the 1.2 release branch and the master.

        Show
        StephanEwen Stephan Ewen added a comment - I think Robert was referring to 1.1.4 RC3. The issue also applies to the 1.2 release branch and the master.
        Hide
        tonycox Anton Solovev added a comment -

        1.1.3 RC3 ? I only see 1.1.3 RC2 on github branches

        Show
        tonycox Anton Solovev added a comment - 1.1.3 RC3 ? I only see 1.1.3 RC2 on github branches
        Hide
        StephanEwen Stephan Ewen added a comment -

        I think that is a problem of org.apache.commons.io.FileUtils: When someone concurrently works on the directory, the delete fails.

        We should have our own utility method for recursive directory that retries listing and deleting contained files to be safe against concurrent deletes by other services.

        Show
        StephanEwen Stephan Ewen added a comment - I think that is a problem of org.apache.commons.io.FileUtils : When someone concurrently works on the directory, the delete fails. We should have our own utility method for recursive directory that retries listing and deleting contained files to be safe against concurrent deletes by other services.

          People

          • Assignee:
            StephanEwen Stephan Ewen
            Reporter:
            rmetzger Robert Metzger
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development