Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.1
Description
Hadoop 3.x+ offers shaded client jars: hadoop-client-api and hadoop-client-runtime, which shade 3rd party dependencies such as Guava, protobuf, jetty etc. This Jira switches Spark to use these jars instead of hadoop-common, hadoop-client etc. Benefits include:
- It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava conflicts, Spark depends on Hadoop to not leaking dependencies.
- It makes Spark/Hadoop dependency cleaner. Currently Spark uses both client-side and server-side Hadoop APIs from modules such as hadoop-common, hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only use public/client API from Hadoop side.
- Provides a better isolation from Hadoop dependencies. In future Spark can better evolve without worrying about dependencies pulled from Hadoop side (which used to be a lot).
There are some behavior changes introduced with this JIRA, when people use Spark compiled with Hadoop 3.x:
- Users now need to make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars when they deploy Spark with the `hadoop-provided` option. In addition, it is high recommended that they put these two jars before other Hadoop jars in the class path. Otherwise, conflicts such as from Guava could happen if classes are loaded from the other non-shaded Hadoop jars.
- Since the new shaded Hadoop clients no longer include 3rd party dependencies. Users who used to depend on these now need to explicitly put the jars in their class path.
Ideally the above should go to release notes.
Attachments
Issue Links
- causes
-
SPARK-33618 hadoop-aws doesn't work
- Resolved
-
SPARK-36835 Spark 3.2.0 POMs are no longer "dependency reduced"
- Resolved
-
SPARK-36873 Add provided Guava dependency for network-yarn module
- Resolved
- depends upon
-
HADOOP-11804 Shaded Hadoop client artifacts and minicluster
- Resolved
- is related to
-
SPARK-34487 K8s integration test should use the runtime Hadoop Version
- Resolved
-
SPARK-35323 Remove unused libraries from LICENSE-binary
- Resolved
-
ORC-1430 Use Hadoop 3.3.5 shaded clients
- Closed
- is required by
-
SPARK-29250 Upgrade to Hadoop 3.3.1
- Resolved
- relates to
-
HBASE-28213 Evaluate using hbase-shaded-client-byo-hadoop for Spark connector
- Resolved
-
SPARK-35959 Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions
- Resolved
- links to