[SPARK-42537] Remove obsolete/superfluous imports in spark-hadoop-cloud module - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: Build
Labels:
None

Description

The explicit imports into hadoop-cloud are obsolete

the hadoop-cloud-storage pom is a cut down export of the bindings to the various cloud stores in their hadoop-* modules
it's been shipping since hadoop 2.10
its grown to include cos and allyun support
fairly well tested
actually cuts removed support (hadoop-openstack) when withdrawn. Hadoop 3.3.5 has done this, leaving a stub jar there just to avoid breaking downstream builds like spark's current setup.

hadoop-cloud-storage should be all that's needed.

I know that the spark hadoop-2 profile still references the (long unsupported 2.7.x), but if you are using those releases then really you aren't going to talk to cloud infra

no abfs connector
s3n connector which won't authenticate with any of the aws regions launched in the past 5-8 years
gcs connector won't work (its java11+; hadoop 3.2.x is minimum for java11 clients)
none of the new chinese cloud services
s3a connector very outdated.
s3a connector using unshaded aws client which is unlikely to work with versions of jackson, httpclient written in the last 5 years, has trouble on java8 etc.

Proposed

hadoop-2 profile to be the minimal hadoop-aws and hadoop-azure dependencies in the code today. cutting to the empty set would be better, but a bit more radical
hadoop-3 profile to pull in hadoop-cloud-storage (excluding aws sdk as today), and nothing else

This will simplify everyone's life as there are fewer dependencies to reconcile.

see also SPARK-39969 proposing making the hadoop-aws versions of the aws-sdk-bundle the normative one, as it is now newer than the spark-kinesis import and more broadly tested

Attachments

Issue Links

is related to

SPARK-43448 Remove dummy hadoop-openstack

Resolved

relates to

SPARK-39969 Spark AWS SDK and kinesis dependencies lagging hadoop-aws and s3a

Open

Activity

People

Assignee:: Unassigned

Reporter:: Steve Loughran

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Feb/23 10:17

Updated:: 11/May/23 09:00