We have a Spark Standalone ETL workload that is heavily dependent on Apache Ignite KV store for lookup/reference data. There are hundreds (400+) of lookup data some are up to 300K records. We formerly used broadcast variables but later found out that it was not fast enough.
So we decided implement a caching mechanism by retrieving reference data from JDBC source and put them in-memory through Apache ignite as replicated cache. Each Spark worker node is also running an Ignite node (JVM). Then we let the spark executors retrieve the data from Ignite through "shared memory port". This is very fast but is causing instability in the Ignite cluster. The reason is that when the Spark executor JVM terminates, the Ignite Data Grid is terminated abnormally. This makes the Ignite cluster wait for the client node (which is the spark executor) to reconnect making the Ignite cluster non-responsive for a while.
We have this need for an ability to close the ignite client node gracefully just before the Executor process ends. So a feature that makes it possible to pass an EventHandler for "executor.onStart" and "executor.exitExecutor()" would be really really useful.
It could be a spark-submit argument or an entry in the spark-defaults.conf that looks something like:
The class will have to implement an interface provided by Spark. This class can then be loaded dynamically in the CoarseGrainedExecutorBackend and called on the onStart() and exitExecutor() methods respectively
This is also useful for opening and closing JDBC connections per executor instead of per partition.