Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When a HiveOnSpark job is kicked off most of the work is done by the RemoteDriver, which is a separate process. There a couple of smaller parts of code, where HS2 process depends on Spark jars, these for example include receiving stats from the driver or putting together a Spark conf object - used mostly during communication with RemoteDriver.
We can limit the data types used for such communication so that we don't use (and serialize) types that are in Spark codebase, and hence we can refactor our code to only use Spark jars in the Remote Driver process.
I think this way would be cleaner from dependencies point of view, and also less erroneous when users have to compile the classpath for their HS2 processes.
(E.g. due to a change between Spark 2.2 and 2.4 we had to also include spark-unsafe*.jar - though it's an internal change to Spark..)