Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9206

ClassCastException using HiveContext with GoogleHadoopFileSystem as fs.defaultFS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.1
    • 1.5.0
    • SQL
    • None

    Description

      Originally reported on StackOverflow: http://stackoverflow.com/questions/31478955/googlehadoopfilesystem-cannot-be-cast-to-hadoop-filesystem

      Google's "bdutil" command-line tool (https://github.com/GoogleCloudPlatform/bdutil) is one of the main supported ways of deploying Hadoop and Spark cluster on Google Cloud Platform, and has default settings which configure fs.defaultFS to use the Google Cloud Storage connector for Hadoop (and performs installation of the connector jarfile on top of tarball-based Hadoop and Spark distributions).

      Starting in Spark 1.4.1, taking a default bdutil-based Spark deployment, running "spark-shell", and then trying to read a file with sqlContext like:

      sqlContext.parquetFile("gs://my-bucket/my-file.parquet")
      

      results in the following:

      15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
      java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
      	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
      	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:116)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:172)
      	at org.apache.spark.sql.hive.client.IsolatedClientLoader.<init>(IsolatedClientLoader.scala:168)
      	at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:213)
      	at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:176)
      	at org.apache.spark.sql.hive.HiveContext$$anon$2.<init>(HiveContext.scala:371)
      	at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:371)
      	at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:370)
      	at org.apache.spark.sql.hive.HiveContext$$anon$1.<init>(HiveContext.scala:383)
      	at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:383)
      	at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:382)
      	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
      	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
      	at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)
      	at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)
      	at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:1099)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:19)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
      	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
      	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:28)
      	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
      	at $iwC$$iwC$$iwC.<init>(<console>:32)
      	at $iwC$$iwC.<init>(<console>:34)
      	at $iwC.<init>(<console>:36)
      	at <init>(<console>:38)
      	at .<init>(<console>:42)
      	at .<clinit>(<console>)
      	at .<init>(<console>:7)
      	at .<clinit>(<console>)
      	at $print(<console>)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
      	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
      	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
      	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
      	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
      	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
      	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
      	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
      	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
      	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
      	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
      	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
      	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
      	at org.apache.spark.repl.Main$.main(Main.scala:31)
      	at org.apache.spark.repl.Main.main(Main.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
      	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
      	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
      	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
      	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
      	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
      	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
      	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
      	... 67 more
      

      This appears to be a combination of https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 and its related "isolated classloader" changes with the IsolatedClientLoader.isSharedClass method including "com.google." alongside java.lang., java.net.*, etc., as shared classes, presumably for inclusion of Guava and possibly protobuf and gson libraries.

      Unfortunately, this also includes the Hadoop extended libraries like com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem (https://github.com/GoogleCloudPlatform/bigdata-interop) and com.google.cloud.bigtable.* (https://github.com/GoogleCloudPlatform/cloud-bigtable-client).

      This can be reproduced by downloading bdutil from https://github.com/GoogleCloudPlatform/bdutil, modifying bdutil/extensions/spark/spark_env.sh to set SPARK_HADOOP2_TARBALL_URI to some Spark 1.4.1 tarball URI (http URIs should work as well) and then deploying a cluster with:

      ./bdutil -p <your project> -b <your GCS bucket> -z us-central1-f -e hadoop2 -e spark deploy
      ./bdutil -p <your project> -b <your GCS bucket> -z us-central1-f -e hadoop2 -e spark shell
      

      The last command opens an SSH session; then type:

      spark-shell
      > sqlContext.parquetFile("gs://your-bucket/some/path/to/parquet/file.parquet")
      

      The ClassCastException should then immediately get thrown.

      The simple fix of simply excluding com.google.cloud.* from being "shared classes" appears to work just fine in an end-to-end bdutil-based deployment.

      Attachments

        Issue Links

          Activity

            People

              dennishuo Dennis Huo
              dennishuo Dennis Huo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: