Uploaded image for project: 'Livy'
  1. Livy
  2. LIVY-504

Livy pyspark sqlContext behavior does not match pyspark shell

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.5.0
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Environment:
      AWS EMR 5.16.0

      Description

      On 0.5.0 I'm seeing inconsistent behavior through Livy regarding the spark context and sqlContext compared to the pyspark shell.

      For example running this through the pyspark shell works:

      [root@ip-10-0-0-32 ~]# pyspark
      Python 2.7.14 (default, May 2 2018, 18:31:34)
      [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      18/08/28 18:50:37 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
      Welcome to
      ____ __
      / __/__ ___ _____/ /__
      _\ \/ _ \/ _ `/ __/ '_/
      /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
      /_/
      
      Using Python version 2.7.14 (default, May 2 2018 18:31:34)
      SparkSession available as 'spark'.
      >>> from pyspark.sql import SQLContext
      >>> my_sql_context = SQLContext.getOrCreate(sc)
      >>> df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet')
      >>> print(df.count())
      67556724
      

      But through Livy, the same code throws an exception

      from pyspark.sql import SQLContext
      my_sql_context = SQLContext.getOrCreate(sc)
      df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet')
      
      'JavaMember' object has no attribute 'read'
      Traceback (most recent call last):
        File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in read
          return DataFrameReader(self)
        File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 70, in __init__
          self._jreader = spark._ssql_ctx.read()
      AttributeError: 'JavaMember' object has no attribute 'read'

      Also trying to use the default initialized sqlContext throws the same error

      df = sqlContext.read.parquet('s3://my-bucket/mydata.parquet')
      
      'JavaMember' object has no attribute 'read'
      Traceback (most recent call last):
        File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in read
          return DataFrameReader(self)
        File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 70, in __init__
          self._jreader = spark._ssql_ctx.read()
      AttributeError: 'JavaMember' object has no attribute 'read'

      In both the spark shell and the livy versions, the objects look the same.

      pyspark shell:

      >>> print(sc)
      <SparkContext master=yarn appName=PySparkShell>
      >>> print(sqlContext)
      <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450>
      >>> print(my_sql_context)
      <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450>

      livy:

      print(sc)
      <SparkContext master=yarn appName=livy-session-1>
      
      print(sqlContext)
      <pyspark.sql.context.SQLContext object at 0x7f478c06b850>
      
      print(my_sql_context)
      <pyspark.sql.context.SQLContext object at 0x7f478c06b850>

      I'm running this through sparkmagic but also have confirmed this is the same behavior when calling the api directly.

      curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool
      {
          "appId": null,
          "appInfo": {
              "driverLogUrl": null,
              "sparkUiUrl": null
          },
          "id": 3,
          "kind": "pyspark",
          "log": [
              "stdout: ",
              "\nstderr: ",
              "\nYARN Diagnostics: "
          ],
          "owner": null,
          "proxyUser": null,
          "state": "starting"
      }
      
      curl --silent localhost:8998/sessions/3/statements -X POST -H 'Content-Type: application/json' -d '{"code":"df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")"}' | python -m json.tool
      {
          "code": "df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")",
          "id": 1,
          "output": null,
          "progress": 0.0,
          "state": "running"
      }
      

      When running on 0.4.0 both pyspark shell and livy versions worked.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              adambronte Adam Bronte
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: