DiskBlockManager has a notion of a "scratch" local folder(s), which can be configured via spark.local.dir option, and which defaults to the system's /tmp. The hierarchy is two-level, e.g. /blockmgr-XXX.../YY, where the YY part is a hash bit, to spread files evenly.
Function DiskBlockManager.getFile expects the top level directories (blockmgr-XXX...) to always exist (they get created once, when the spark context is first created), otherwise it would fail with a message like:
However, this may not always be the case.
In particular, if it's the default /tmp folder, there can be different strategies of automatically removing files from it, depending on the OS:
- on the boot time
- on a regular basis (e.g. once per day via a system cron job)
- based on the file age
The symptom is that after the process (in our case, a service) using spark is running for a while (a few days), it may not be able to load files anymore, since the top-level scratch directories are not there and DiskBlockManager.getFile crashes.
Please note that this is different from people arbitrarily removing files manually.
We have both the facts that /tmp is the default in the spark config and that the system has the right to tamper with its contents, and will do it with a high probability, after some period of time.