Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43744

Spark Connect scala UDF serialization pulling in unrelated classes not available on server

    XMLWordPrintableJSON

Details

    Description

      https://github.com/apache/spark/pull/41487 moved "interrupt all - background queries, foreground interrupt" and "interrupt all - foreground queries, background interrupt" tests from ClientE2ETestSuite into a new isolated suite SparkSessionE2ESuite to avoid an unexplicable UDF serialization issue.

       

      When these tests are moved back to ClientE2ETestSuite and when testing with

      build/mvn clean install -DskipTests -Phive
      build/mvn test -pl connector/connect/client/jvm -Dtest=none -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite

       
      the tests fails with

      23/05/22 15:44:11 ERROR SparkConnectService: Error during: execute. UserId: . SessionId: 0f4013ca-3af9-443b-a0e5-e339a827e0cf.
      java.lang.NoClassDefFoundError: org/apache/spark/sql/connect/client/SparkResult
      at java.lang.Class.getDeclaredMethods0(Native Method)
      at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
      at java.lang.Class.getDeclaredMethod(Class.java:2128)
      at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1643)
      at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:79)
      at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:520)
      at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:494)
      at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
      at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681)
      at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
      at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
      at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1815)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1640)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
      at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
      at org.apache.spark.util.Utils$.deserialize(Utils.scala:148)
      at org.apache.spark.sql.connect.planner.SparkConnectPlanner.org$apache$spark$sql$connect$planner$SparkConnectPlanner$$unpackUdf(SparkConnectPlanner.scala:1353)
      at org.apache.spark.sql.connect.planner.SparkConnectPlanner$TypedScalaUdf$.apply(SparkConnectPlanner.scala:761)
      at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformTypedMapPartitions(SparkConnectPlanner.scala:531)
      at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformMapPartitions(SparkConnectPlanner.scala:495)
      at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:143)
      at org.apache.spark.sql.connect.service.SparkConnectStreamHandler.handlePlan(SparkConnectStreamHandler.scala:100)
      at org.apache.spark.sql.connect.service.SparkConnectStreamHandler.$anonfun$handle$2(SparkConnectStreamHandler.scala:87)
      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
      at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:825)
      at org.apache.spark.sql.connect.service.SparkConnectStreamHandler.$anonfun$handle$1(SparkConnectStreamHandler.scala:53)
      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
      at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:209)
      at org.apache.spark.sql.connect.artifact.SparkConnectArtifactManager$.withArtifactClassLoader(SparkConnectArtifactManager.scala:178)
      at org.apache.spark.sql.connect.service.SparkConnectStreamHandler.handle(SparkConnectStreamHandler.scala:48)
      at org.apache.spark.sql.connect.service.SparkConnectService.executePlan(SparkConnectService.scala:166)
      at org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:611)
      at org.sparkproject.connect.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
      at org.sparkproject.connect.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:352)
      at org.sparkproject.connect.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866)
      at org.sparkproject.connect.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
      at org.sparkproject.connect.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:750)
      Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.connect.client.SparkResult
      at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
      ... 56 more

      for some reason, the serialization of 

      n => { Thread.sleep(30000); n }

      closure into an UDF tries to reference SparkResult class on the Spark Connect server... (???)

       

      See https://github.com/apache/spark/pull/41005/files/35e500d4cb72f8d3bee21a7f86ee16cbbc8a936c#r1200551487 and https://github.com/apache/spark/pull/41487#discussion_r1221390234 thread for some debugging info.

      Attachments

        Activity

          People

            zhenli Zhen Li
            juliuszsompolski Juliusz Sompolski
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: