Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.2
-
None
Description
After updating to spark 2.4.2 when using the
spark.read.format().options().load()
chain of methods, regardless of what parameter is passed to "format" we get the following error related to avro:
- .options(**load_options) - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load. - : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated - at java.util.ServiceLoader.fail(ServiceLoader.java:232) - at java.util.ServiceLoader.access$100(ServiceLoader.java:185) - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) - at java.util.ServiceLoader$1.next(ServiceLoader.java:480) - at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) - at scala.collection.Iterator.foreach(Iterator.scala:941) - at scala.collection.Iterator.foreach$(Iterator.scala:941) - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) - at scala.collection.IterableLike.foreach(IterableLike.scala:74) - at scala.collection.IterableLike.foreach$(IterableLike.scala:73) - at scala.collection.AbstractIterable.foreach(Iterable.scala:56) - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250) - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248) - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) - at scala.collection.TraversableLike.filter(TraversableLike.scala:262) - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262) - at scala.collection.AbstractTraversable.filter(Traversable.scala:108) - at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630) - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) - at java.lang.reflect.Method.invoke(Method.java:498) - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) - at py4j.Gateway.invoke(Gateway.java:282) - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) - at py4j.commands.CallCommand.execute(CallCommand.java:79) - at py4j.GatewayConnection.run(GatewayConnection.java:238) - at java.lang.Thread.run(Thread.java:748) - Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat$class - at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44) - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) - at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) - at java.lang.reflect.Constructor.newInstance(Constructor.java:423) - at java.lang.Class.newInstance(Class.java:442) - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) - ... 29 more - Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.FileFormat$class - at java.net.URLClassLoader.findClass(URLClassLoader.java:382) - at java.lang.ClassLoader.loadClass(ClassLoader.java:424) - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) - at java.lang.ClassLoader.loadClass(ClassLoader.java:357) - ... 36 more
The code we run looks like this:
spark_session = ( SparkSession.builder .appName(APPLICATION_NAME) .master(MASTER_URL) .config('spark.cassandra.connection.host', SERVER_IP_ADDRESS) .config('spark.cassandra.auth.username', CASSANDRA_USERNAME) .config('spark.cassandra.auth.password', CASSANDRA_PASSWORD) .config('spark.sql.shuffle.partitions', 16) .config('parquet.enable.summary-metadata', 'true') .getOrCreate()) load_options = { 'keyspace': CASSANDRA_KEYSPACE, 'table': TABLE_NAME, 'spark.cassandra.input.fetch.size_in_rows': '150' } df = (spark_session.read.format('org.apache.spark.sql.cassandra') .options(**load_options) .load())
We get the exact same error when trying to read a local .avro file instead of from Cassandra.
Up to now we included the .jar file for Spark-Avro using the spark-submit --jars option. The version of Spark-Avro that we used, and worked with Spark 2.4.1, was Spark-Avro 2.4.0.
In an attempt to fix this problem we tried updating the .jar file version. We also tried using the --packages option, with different version combinations, but none of these solutions worked. The same error shows up every time.
When rolling back to Spark 2.4.1 with the exact same setup and code, the error doesn't show up and everything works fine.
Any ideas on what could be causing this?