Uploaded image for project: 'Livy'
  1. Livy
  2. LIVY-457

PySpark `sqlContext.sparkSession` incorrect on Spark 2.x

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.5.1, 0.6.0
    • Component/s: None
    • Labels:
      None
    • Environment:
      RHEL6, Spark 2.1.2.1

      Description

      It looks like the SQLContext we create in PySpark sessions isn't constructed correctly. Compare how the behavior has changed between Livy 0.4.0 and what is currently on master (0.6.0).

      Livy 0.4.0

      $ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/1/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/1/statements/0 | python -m json.tool
      {
          "id": 0,
          "state": "available",
          "output": {
              "status": "ok",
              "execution_count": 0,
              "data": {
                  "text/plain": "<pyspark.sql.session.SparkSession object at 0x15a26d0>"
              }
          },
          "progress": 1.0
      }
      

      Livy 0.6.0

      $ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/0/statements/0 | python -m json.tool
      {
          "id": 0,
          "code": "sqlContext.sparkSession",
          "state": "available",
          "output": {
              "status": "ok",
              "execution_count": 0,
              "data": {
                  "text/plain": "JavaObject id=o4"
              }
          },
          "progress": 1.0
      }
      
      $ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sqlContext.sparkSession.toString()"}' | python -m json.tool
      
      $ curl --silent localhost:8998/sessions/0/statements/1 | python -m json.tool
      {
          "id": 1,
          "code": "sqlContext.sparkSession.toString()",
          "state": "available",
          "output": {
              "status": "ok",
              "execution_count": 1,
              "data": {
                  "text/plain": "'org.apache.spark.sql.hive.HiveContext@200334d0'"
              }
          },
          "progress": 1.0
      }
      

      Notice how the value of sqlContext.sparkSession went from a pyspark.sql.session.SparkSession to a org.apache.spark.sql.hive.HiveContext?

      I suspect this is because of the change @ https://github.com/apache/incubator-livy/commit/c1aafeb6cb87f2bd7f4cb7cf538822b59fb34a9c#diff-c58e3946d3530f54014129c268988e01R563 passing jsqlc in as the second positional parameter to SQLContext, whereas the diff @ https://github.com/apache/spark/commit/89addd40abdacd65cc03ac8aa5f9cf3dd4a4c19b#diff-74ba016ef40c1cb268e14aee817d71bdR50 suggests it should be the third positional parameter.

      I'd wager the fix is simply to explicitly pass that parameter as a keyword argument instead.

      sqlc = SQLContext(sc, jsqlContext=jsqlc)
      

        Attachments

          Activity

            People

            • Assignee:
              jerryshao Saisai Shao
              Reporter:
              danfike Dan Fike
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: