-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 0.5.0
-
Fix Version/s: None
-
Component/s: Core
-
Labels:None
-
Environment:AWS EMR 5.16.0
On 0.5.0 I'm seeing inconsistent behavior through Livy regarding the spark context and sqlContext compared to the pyspark shell.
For example running this through the pyspark shell works:
[root@ip-10-0-0-32 ~]# pyspark Python 2.7.14 (default, May 2 2018, 18:31:34) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 Type "help", "copyright", "credits" or "license" for more information. 18/08/28 18:50:37 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Python version 2.7.14 (default, May 2 2018 18:31:34) SparkSession available as 'spark'. >>> from pyspark.sql import SQLContext >>> my_sql_context = SQLContext.getOrCreate(sc) >>> df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet') >>> print(df.count()) 67556724
But through Livy, the same code throws an exception
from pyspark.sql import SQLContext my_sql_context = SQLContext.getOrCreate(sc) df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet') 'JavaMember' object has no attribute 'read' Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in read return DataFrameReader(self) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 70, in __init__ self._jreader = spark._ssql_ctx.read() AttributeError: 'JavaMember' object has no attribute 'read'
Also trying to use the default initialized sqlContext throws the same error
df = sqlContext.read.parquet('s3://my-bucket/mydata.parquet') 'JavaMember' object has no attribute 'read' Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in read return DataFrameReader(self) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 70, in __init__ self._jreader = spark._ssql_ctx.read() AttributeError: 'JavaMember' object has no attribute 'read'
In both the spark shell and the livy versions, the objects look the same.
pyspark shell:
>>> print(sc) <SparkContext master=yarn appName=PySparkShell> >>> print(sqlContext) <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450> >>> print(my_sql_context) <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450>
livy:
print(sc) <SparkContext master=yarn appName=livy-session-1> print(sqlContext) <pyspark.sql.context.SQLContext object at 0x7f478c06b850> print(my_sql_context) <pyspark.sql.context.SQLContext object at 0x7f478c06b850>
I'm running this through sparkmagic but also have confirmed this is the same behavior when calling the api directly.
curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool { "appId": null, "appInfo": { "driverLogUrl": null, "sparkUiUrl": null }, "id": 3, "kind": "pyspark", "log": [ "stdout: ", "\nstderr: ", "\nYARN Diagnostics: " ], "owner": null, "proxyUser": null, "state": "starting" }
curl --silent localhost:8998/sessions/3/statements -X POST -H 'Content-Type: application/json' -d '{"code":"df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")"}' | python -m json.tool { "code": "df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")", "id": 1, "output": null, "progress": 0.0, "state": "running" }
When running on 0.4.0 both pyspark shell and livy versions worked.