Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39813

Unable to connect to Presto in Pyspark: java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • None
    • PySpark
    • I am running this in a docker built from the jupyter/all-spark-notebook image.

    Description

      My team has a bash script + python script that uses pyspark to extract data from a hive database. The scripts work when run on a server. However, we need to containerize this since we will not have access to that server in the future.

      Thus, I am trying to get the job to work from a container. 

      When trying to run the scripts locally or in a docker container, I am running into driver issues. Unfortunately, nobody on my team helped set up the environment on the server, where everything is working. Thus, we are having a hard time figuring out what is wrong with our local/containerized environments and cannot replicate a successful script run. 

       

      From a container, I run a bash script that does the following:

       

      ```{bash}

      $SPARK_HOME/bin/spark-submit etl_job.py

      ```

       

      The contents of `etl_job.py` are as follows:

      ```{python}
      print('\n\nStarting python job\n\n')
      from pyspark.context import SparkContext
      from pyspark.sql.session import SparkSession
      from credentials import PRESTO_USER, PRESTO_PASSWORD, PRESTO_URL, PRESTO_SSL, SSL_PASSWORD, TEST_SCHEMA, TEST_TABLE
      import pandas as pd

      print('\n\nStarting spark session \n\n')
      sc = SparkContext.getOrCreate()
      spark = SparkSession(sc)

      print('\n\nConnecting to Presto\n\n')
      Prestoprod = (
      spark.read.format("jdbc")
      .option("url", PRESTO_URL)
      .option("user", PRESTO_USER)
      .option("password", PRESTO_PASSWORD)
      .option("driver", "com.facebook.presto.jdbc.PrestoDriver")
      .option("SSL", "true")
      .option("SSLKeyStorePath", PRESTO_SSL)
      .option("SSLKeyStorePassword", SSL_PASSWORD)
      )

       
      print('\n\nTrying to query Presto.\n\n')
      query = "select * from hive.{}.{}".format(TEST_SCHEMA, TEST_TABLE)

      results = (
      Prestoprod.option(
      "query",
      query

      )
      .load()
      )
      ```

       

      However, when I run the job, I get the following error:

      ```

      py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
      : java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver

      ```

       

      Here is the full log:

      ```

      Starting python job

       

      Starting spark session 

      Setting default log level to "WARN".
      To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
      22/07/19 04:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

      Connecting to Presto

       

      Trying to query Presto.

      Traceback (most recent call last):
        File "/home/jovyan/etl_job.py", line 30, in <module>
          .load()
        File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 184, in load
          return self._df(self._jreader.load())
        File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in {}call{}
        File "/usr/local/spark/python/pyspark/sql/utils.py", line 190, in deco
          return f(*a, **kw)
        File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
      : java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver
              at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
              at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
              at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
              at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
              at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:101)
              at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:101)
              at scala.Option.foreach(Option.scala:437)
              at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:101)
              at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:39)
              at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
              at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
              at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
              at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
              at scala.Option.getOrElse(Option.scala:201)
              at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
              at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
              at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
              at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.base/java.lang.reflect.Method.invoke(Method.java:568)
              at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
              at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
              at py4j.Gateway.invoke(Gateway.java:282)
              at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
              at py4j.commands.CallCommand.execute(CallCommand.java:79)
              at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
              at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
              at java.base/java.lang.Thread.run(Thread.java:833)

      ```

       

      Other relevant info is that I am trying to run this on a container built from the jupyter/all-spark-notebook image fron dockerhub.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dlaz David Lassiter
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified