[SPARK-9206] ClassCastException using HiveContext with GoogleHadoopFileSystem as fs.defaultFS - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.1
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Description

Originally reported on StackOverflow: http://stackoverflow.com/questions/31478955/googlehadoopfilesystem-cannot-be-cast-to-hadoop-filesystem

Google's "bdutil" command-line tool (https://github.com/GoogleCloudPlatform/bdutil) is one of the main supported ways of deploying Hadoop and Spark cluster on Google Cloud Platform, and has default settings which configure fs.defaultFS to use the Google Cloud Storage connector for Hadoop (and performs installation of the connector jarfile on top of tarball-based Hadoop and Spark distributions).

Starting in Spark 1.4.1, taking a default bdutil-based Spark deployment, running "spark-shell", and then trying to read a file with sqlContext like:

sqlContext.parquetFile("gs://my-bucket/my-file.parquet")

results in the following:

15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:116)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:172)
	at org.apache.spark.sql.hive.client.IsolatedClientLoader.<init>(IsolatedClientLoader.scala:168)
	at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:213)
	at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:176)
	at org.apache.spark.sql.hive.HiveContext$$anon$2.<init>(HiveContext.scala:371)
	at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:371)
	at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:370)
	at org.apache.spark.sql.hive.HiveContext$$anon$1.<init>(HiveContext.scala:383)
	at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:383)
	at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:382)
	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
	at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)
	at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)
	at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:1099)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:19)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:28)
	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
	at $iwC$$iwC$$iwC.<init>(<console>:32)
	at $iwC$$iwC.<init>(<console>:34)
	at $iwC.<init>(<console>:36)
	at <init>(<console>:38)
	at .<init>(<console>:42)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
	... 67 more

This appears to be a combination of https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 and its related "isolated classloader" changes with the IsolatedClientLoader.isSharedClass method including "com.google." alongside java.lang., java.net.*, etc., as shared classes, presumably for inclusion of Guava and possibly protobuf and gson libraries.

Unfortunately, this also includes the Hadoop extended libraries like com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem (https://github.com/GoogleCloudPlatform/bigdata-interop) and com.google.cloud.bigtable.* (https://github.com/GoogleCloudPlatform/cloud-bigtable-client).

This can be reproduced by downloading bdutil from https://github.com/GoogleCloudPlatform/bdutil, modifying bdutil/extensions/spark/spark_env.sh to set SPARK_HADOOP2_TARBALL_URI to some Spark 1.4.1 tarball URI (http URIs should work as well) and then deploying a cluster with:

./bdutil -p <your project> -b <your GCS bucket> -z us-central1-f -e hadoop2 -e spark deploy
./bdutil -p <your project> -b <your GCS bucket> -z us-central1-f -e hadoop2 -e spark shell

The last command opens an SSH session; then type:

spark-shell
> sqlContext.parquetFile("gs://your-bucket/some/path/to/parquet/file.parquet")

The ClassCastException should then immediately get thrown.

The simple fix of simply excluding com.google.cloud.* from being "shared classes" appears to work just fine in an end-to-end bdutil-based deployment.

Attachments

Issue Links

is related to

HADOOP-17372 S3A AWS Credential provider loading gets confused with isolated classloaders

Resolved

links to

[Github] Pull Request #7549 (dennishuo)

ClassCastException using HiveContext with GoogleHadoopFileSystem as fs.defaultFS

Details

Description

Attachments

Issue Links

Activity

People

Dates