The basic interface can just be a marker trait, as that allows a plugin to monitor general characteristics of the JVM (eg. monitor memory or take thread dumps). Optionally, we could include methods for task start and end events. This would allow more control on monitoring – eg., you could start polling thread dumps only if there was a task from a particular stage that had been taking too long. But anything task related is a bit trickier to decide the right api. Should the task end event also get the failure reason? Should those events get called in the same thread as the task runner, or in another thread?
The ask is to add exactly that. I've put up a draft PR in our fork of spark and I'm happy to push it upstream. Also happy to receive comments on what's the right interface to expose - not opinionated on that front, tried to expose the simplest interface for now.
The main reason for this ask is to propagate tracing information from the driver to the executors (SPARK-21962 has some context). On HADOOP-15566 I see we're discussing how to add tracing to the Apache ecosystem, but my problem is slightly different: I want to use this interface to propagate tracing information to my framework of choice. If the Hadoop issue gets solved we'll have a framework to communicate tracing information inside the Apache ecosystem, but it's highly unlikely that all Spark users will use the same common framework. Therefore we should still provide plugin interfaces where the tracing information can be propagated appropriately.
To give more color, in our case the tracing information is stored in a thread local, therefore it needs to be set in the same thread which is executing the task. [*]
While our framework is specific, I imagine such an interface could be useful in general. Happy to hear your thoughts about it.
[*] Something I did not mention was how to propagate the tracing information from the driver to the executors. For that I intend to use 1. the driver's localProperties, which 2. will be eventually propagated to the executors' TaskContext, which 3. I'll be able to access from the methods above.