Hadoop 3, and particular Hadoop 3.1 adds:
- Java 8 as the minimum (and currently sole) supported Java version
- A new "hadoop-cloud-storage" module intended to be a minimal dependency POM for all the cloud connectors in the version of hadoop built against
- The ability to declare a committer for any FileOutputFormat which supercedes the classic FileOutputCommitter -in both a job and for a specific FS URI
- A shaded client JAR, though not yet one complete enough for spark.
- Lots of other features and fixes.
The basic work of building spark with hadoop 3 is one of just doing the build with -Dhadoop.version=3.x.y; however that
- Doesn't build on SBT (dependency resolution of zookeeper JAR)
- Misses the new cloud features
The ZK dependency can be fixed everywhere by explicitly declaring the ZK artifact, instead of relying on curator to pull it in; this needs a profile to declare the right ZK version, obviously..
To use the cloud features spark the hadoop-3 profile should declare that the spark-hadoop-cloud module depends on —and only on— the hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud storage, and a source package which is only built and tested when build against Hadoop 3.1+
Issue Links
- is depended upon by
SPARK-18673 Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
- Resolved
- links to