[LIVY-457] PySpark `sqlContext.sparkSession` incorrect on Spark 2.x - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.0
Fix Version/s: 0.5.1, 0.6.0
Component/s: None
Labels:
None
Environment:
RHEL6, Spark 2.1.2.1

Description

It looks like the SQLContext we create in PySpark sessions isn't constructed correctly. Compare how the behavior has changed between Livy 0.4.0 and what is currently on master (0.6.0).

Livy 0.4.0

$ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool

$ curl --silent localhost:8998/sessions/1/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool

$ curl --silent localhost:8998/sessions/1/statements/0 | python -m json.tool
{
    "id": 0,
    "state": "available",
    "output": {
        "status": "ok",
        "execution_count": 0,
        "data": {
            "text/plain": "<pyspark.sql.session.SparkSession object at 0x15a26d0>"
        }
    },
    "progress": 1.0
}

Livy 0.6.0

$ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool

$ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool

$ curl --silent localhost:8998/sessions/0/statements/0 | python -m json.tool
{
    "id": 0,
    "code": "sqlContext.sparkSession",
    "state": "available",
    "output": {
        "status": "ok",
        "execution_count": 0,
        "data": {
            "text/plain": "JavaObject id=o4"
        }
    },
    "progress": 1.0
}

$ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession.toString()"}' | python -m json.tool

$ curl --silent localhost:8998/sessions/0/statements/1 | python -m json.tool
{
    "id": 1,
    "code": "sqlContext.sparkSession.toString()",
    "state": "available",
    "output": {
        "status": "ok",
        "execution_count": 1,
        "data": {
            "text/plain": "'org.apache.spark.sql.hive.HiveContext@200334d0'"
        }
    },
    "progress": 1.0
}

Notice how the value of sqlContext.sparkSession went from a pyspark.sql.session.SparkSession to a org.apache.spark.sql.hive.HiveContext?

I suspect this is because of the change @ https://github.com/apache/incubator-livy/commit/c1aafeb6cb87f2bd7f4cb7cf538822b59fb34a9c#diff-c58e3946d3530f54014129c268988e01R563 passing jsqlc in as the second positional parameter to SQLContext, whereas the diff @ https://github.com/apache/spark/commit/89addd40abdacd65cc03ac8aa5f9cf3dd4a4c19b#diff-74ba016ef40c1cb268e14aee817d71bdR50 suggests it should be the third positional parameter.

I'd wager the fix is simply to explicitly pass that parameter as a keyword argument instead.

sqlc = SQLContext(sc, jsqlContext=jsqlc)

Attachments

Activity

People

Assignee:: Saisai Shao

Reporter:: Dan Fike

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Apr/18 16:51

Updated:: 18/Apr/18 02:30

Resolved:: 18/Apr/18 02:30