Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.4.0
-
None
-
None
Description
The explicit imports into hadoop-cloud are obsolete
- the hadoop-cloud-storage pom is a cut down export of the bindings to the various cloud stores in their hadoop-* modules
- it's been shipping since hadoop 2.10
- its grown to include cos and allyun support
- fairly well tested
- actually cuts removed support (hadoop-openstack) when withdrawn. Hadoop 3.3.5 has done this, leaving a stub jar there just to avoid breaking downstream builds like spark's current setup.
hadoop-cloud-storage should be all that's needed.
I know that the spark hadoop-2 profile still references the (long unsupported 2.7.x), but if you are using those releases then really you aren't going to talk to cloud infra
- no abfs connector
- s3n connector which won't authenticate with any of the aws regions launched in the past 5-8 years
- gcs connector won't work (its java11+; hadoop 3.2.x is minimum for java11 clients)
- none of the new chinese cloud services
- s3a connector very outdated.
- s3a connector using unshaded aws client which is unlikely to work with versions of jackson, httpclient written in the last 5 years, has trouble on java8 etc.
Proposed
- hadoop-2 profile to be the minimal hadoop-aws and hadoop-azure dependencies in the code today. cutting to the empty set would be better, but a bit more radical
- hadoop-3 profile to pull in hadoop-cloud-storage (excluding aws sdk as today), and nothing else
This will simplify everyone's life as there are fewer dependencies to reconcile.
see also SPARK-39969 proposing making the hadoop-aws versions of the aws-sdk-bundle the normative one, as it is now newer than the spark-kinesis import and more broadly tested
Attachments
Issue Links
- is related to
-
SPARK-43448 Remove dummy hadoop-openstack
- Resolved
- relates to
-
SPARK-39969 Spark AWS SDK and kinesis dependencies lagging hadoop-aws and s3a
- Open