Uploaded image for project: 'Livy'
  1. Livy
  2. LIVY-457

PySpark `sqlContext.sparkSession` incorrect on Spark 2.x

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.6.0
    • 0.5.1, 0.6.0
    • None
    • None
    • RHEL6, Spark 2.1.2.1

    Description

      It looks like the SQLContext we create in PySpark sessions isn't constructed correctly. Compare how the behavior has changed between Livy 0.4.0 and what is currently on master (0.6.0).

      Livy 0.4.0

      $ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/1/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/1/statements/0 | python -m json.tool
      {
          "id": 0,
          "state": "available",
          "output": {
              "status": "ok",
              "execution_count": 0,
              "data": {
                  "text/plain": "<pyspark.sql.session.SparkSession object at 0x15a26d0>"
              }
          },
          "progress": 1.0
      }
      

      Livy 0.6.0

      $ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/0/statements/0 | python -m json.tool
      {
          "id": 0,
          "code": "sqlContext.sparkSession",
          "state": "available",
          "output": {
              "status": "ok",
              "execution_count": 0,
              "data": {
                  "text/plain": "JavaObject id=o4"
              }
          },
          "progress": 1.0
      }
      
      $ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession.toString()"}' | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/0/statements/1 | python -m json.tool
      {
          "id": 1,
          "code": "sqlContext.sparkSession.toString()",
          "state": "available",
          "output": {
              "status": "ok",
              "execution_count": 1,
              "data": {
                  "text/plain": "'org.apache.spark.sql.hive.HiveContext@200334d0'"
              }
          },
          "progress": 1.0
      }
      

      Notice how the value of sqlContext.sparkSession went from a pyspark.sql.session.SparkSession to a org.apache.spark.sql.hive.HiveContext?

      I suspect this is because of the change @ https://github.com/apache/incubator-livy/commit/c1aafeb6cb87f2bd7f4cb7cf538822b59fb34a9c#diff-c58e3946d3530f54014129c268988e01R563 passing jsqlc in as the second positional parameter to SQLContext, whereas the diff @ https://github.com/apache/spark/commit/89addd40abdacd65cc03ac8aa5f9cf3dd4a4c19b#diff-74ba016ef40c1cb268e14aee817d71bdR50 suggests it should be the third positional parameter.

      I'd wager the fix is simply to explicitly pass that parameter as a keyword argument instead.

      sqlc = SQLContext(sc, jsqlContext=jsqlc)
      

      Attachments

        Activity

          People

            jerryshao Saisai Shao
            danfike Dan Fike
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: