Hadoop 3.x+ offers shaded client jars: hadoop-client-api and hadoop-client-runtime, which shade 3rd party dependencies such as Guava, protobuf, jetty etc. This Jira switches Spark to use these jars instead of hadoop-common, hadoop-client etc. Benefits include:
- It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava conflicts, Spark depends on Hadoop to not leaking dependencies.
- It makes Spark/Hadoop dependency cleaner. Currently Spark uses both client-side and server-side Hadoop APIs from modules such as hadoop-common, hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only use public/client API from Hadoop side.
- Provides a better isolation from Hadoop dependencies. In future Spark can better evolve without worrying about dependencies pulled from Hadoop side (which used to be a lot).
There are some behavior changes introduced with this JIRA, when people use Spark compiled with Hadoop 3.x:
- Users now need to make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars when they deploy Spark with the `hadoop-provided` option. In addition, it is high recommended that they put these two jars before other Hadoop jars in the class path. Otherwise, conflicts such as from Guava could happen if classes are loaded from the other non-shaded Hadoop jars.
- Since the new shaded Hadoop clients no longer include 3rd party dependencies. Users who used to depend on these now need to explicitly put the jars in their class path.
Ideally the above should go to release notes.