[SPARK-33212] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.1
Fix Version/s: 3.2.0
Component/s: Spark Core, Spark Submit, SQL, YARN
Labels:
- releasenotes

Description

Hadoop 3.x+ offers shaded client jars: hadoop-client-api and hadoop-client-runtime, which shade 3rd party dependencies such as Guava, protobuf, jetty etc. This Jira switches Spark to use these jars instead of hadoop-common, hadoop-client etc. Benefits include:

It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava conflicts, Spark depends on Hadoop to not leaking dependencies.
It makes Spark/Hadoop dependency cleaner. Currently Spark uses both client-side and server-side Hadoop APIs from modules such as hadoop-common, hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only use public/client API from Hadoop side.
Provides a better isolation from Hadoop dependencies. In future Spark can better evolve without worrying about dependencies pulled from Hadoop side (which used to be a lot).

There are some behavior changes introduced with this JIRA, when people use Spark compiled with Hadoop 3.x:

Users now need to make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars when they deploy Spark with the `hadoop-provided` option. In addition, it is high recommended that they put these two jars before other Hadoop jars in the class path. Otherwise, conflicts such as from Guava could happen if classes are loaded from the other non-shaded Hadoop jars.
Since the new shaded Hadoop clients no longer include 3rd party dependencies. Users who used to depend on these now need to explicitly put the jars in their class path.

Ideally the above should go to release notes.

Attachments

Issue Links

Add Link

causes

SPARK-33618 hadoop-aws doesn't work

Resolved

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment