Reopening doesn't do anything by itself, or cause anyone to consider this. If this just sits for another year, it will have been a tiny part of a larger problem. I would ask those asking to keep this open to advance the discussion, or else I think you'd agree it eventually should be closed. (Here, I'm really speaking about hundreds of issues like this here, not so much this one.)
Part of the problem is that I don't think the details of this feature request were ever elaborated. I think that if you dig into what it would mean, you'd find that a) it's kind of tricky to define and then implement all the right semantics, and b) almost any use case along these lines in my experience is resolved as I suggest, with a simple per-JVM initialization. If the response lately here is, well, we're not quite sure how that works, then we need to get to the bottom of that, not just insisting an issue stay open.
To your points:
- The executor is going to load user code into one classloader, so we do have that an executor = JVM = classloader.
- You can fail things as fast as you like by invoking this init as soon as like in your app.
- It's clear where things execute, or else, we must assume app developers understand this or else all bets are off. The driver program executes things in the driver unless they're part of a distributed map() etc operation, which clearly execute on the executor.
These IMHO aren't reasons to design a new, different, bespoke mechanism. That has a cost too, if you're positing that it's hard to understand when things run where.
The one catch I see is that, by design, we don't control which tasks run on what executors. We can't guarantee init code runs on all executors this way. But, is it meaningful to initialize an executor that never sees an app's tasks? it can't be. Lazy init is a good thing and compatible with the Spark model. If startup time is an issue (and I'm still not clear on the latency problem mentioned above), then it gets a little more complicated, but, that's also a little more niche: just run a dummy mapPartitions at the outset on the same data that the first job would touch, even asynchronously if you like with other driver activities. No need to wait; it just gives the init a head-start on the executors that will need it straight away.
That's just my opinion of course, but I think those are the questions that would need to be answered to argue something happens here.