Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Livy's mechanism of loading third party jar into the interpreter is incorrect, especially when there is a conflicting class in the third party jar.
By third party jars, I mean the jars you supply while creating session
{"name":"session-name", "kind":"spark", "jars":["hdfs://path/to/jar/1.jar"]}
Now when we have a conflict (the scenario where the jar is a fat jar and it is bundling e.g. some older hadoop libs, or older jackson libs in it), we run into a problem and the problem menifests in a weird way.
Let's say your jar has a class named `a.b.c.SomeClass`.
This is what I have observed
- create session goes through
- you're not able to import things from your jar. Like running code like `import com.path.SomeClass` fails saying error: object path is not a member of package com
- But you are able to load your classes from the jar by running code like `Thread.currentThread.getContextClassLoader.loadClass("a.b.c.SomeClass")`
Essentially the classloaders are messed up. You can load a class by reflection, but REPL has no idea of this class being in the classpath.
I have seen more reports of such a problem on jira, google-group, stack-overflow etc. Mentioning a few:
- https://stackoverflow.com/questions/65654752/getting-import-error-while-executing-statements-via-livy-sessions-with-emr
- https://community.cloudera.com/t5/Support-Questions/How-to-import-External-Libraries-for-Livy-Interpreter-using/td-p/171812
- https://community.cloudera.com/t5/Support-Questions/Livy-Spark-Rest-Jar-submission-interactive-session/td-p/302924
- https://groups.google.com/a/cloudera.org/g/hue-user/c/wR6d7gR_Avs
- https://community.cloudera.com/t5/Community-Articles/Added-external-package-to-livy-causes-quot-console-25-quot/ta-p/245802
- https://issues.apache.org/jira/browse/LIVY-857
There is no definitive answer there. People have suggested these things
1. Adding jar to livy installation. Inside repl-jars
2. Adding jar to livy rsc-jars
3. Adding jar to hadoop installation on all nodes and using spark.
4. Use packages(group:artifact:version), not jars
We tried all these, the first two didn't work for us, the third did. But the third mechanism is not ideal, because you're treating a third party jar as a library jar (equivalent to hadoop/spark jars) and that is not always feasible on prod systems.
The fourth mechanism is not always feasible, as livy only lets you specify packages and not their repository locations.
Now, digging deeper, we figured out the cause and a potential solution.
Livy uses scala interpreter under the hood. Relevant classes [ILoop] (https://github.com/scala/scala/blob/a05d71a1ea33b265015794f71d12020d3f7ddd1f/src/repl/scala/tools/nsc/interpreter/ILoop.scala#L646-L701) and [IMain](https://github.com/scala/scala/blob/a05d71a1ea33b265015794f71d12020d3f7ddd1f/src/repl/scala/tools/nsc/interpreter/IMain.scala#L251]
If you look at the first link, we see there are two methods in `ILoop`, both of which are wrappers around `intp.addUrlsToClassPath`. The first wrapper `addClasspath` is deprecated and the second wrapper `require` is recommended. The `require` method does extra checks on the jar before actually calling `intp.addUrlsToClassPath`. The checks are just for class-conflict. If there is any class in the required-jars that conflicts with already loaded classes, it won't be loaded. Scala’s REPLs class path is a bit fragile and does not allow repeating classes defined within multiple jars. So they go around this issue by exposing the `require` interface in the command line. By using `require`, the user gets to know what is the conflict and they can take corrective action. If we bypass `require` (which is what is happening in Livy's REPL code), we get into this state where you can load classes through reflection but you can't import them. So essentially what happens is that *you tried to add a jar and eventually it seems that it did not get loaded on the livy session*
Now, to anyone looking for a workaround to this, just clean up your jar and make sure you have as little conflicts with hadoop/scala/spark libraries. If your lib depends on these, make them `provided` and don't bundle them in your jar.
The fix from livy codebase should be to improve on their error reporting. If they start using the `require` API, then the user would see some error in the session logs and/or statement outputs.