Uploaded image for project: 'Livy'
  1. Livy
  2. LIVY-322

JsonParseException on failure to parse text output from subprocess call to hadoop fs -rm

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.3
    • 0.9.0
    • API, Interpreter
    • None

    Description

      In a pyspark session, if you run a subprocess.call() to do a "hadoop fs -rm" on a Hadoop 2.7 cluster, the response from the "hadoop fs -rm" (a text response that it has moved the file to the .Trash folder in HDFS) will cause a JsonParseException in Livy, and then all following statement executions in the session will fail to work right.

      I suspect there is something in the response from the hadoop fs that is tripping up Livy in the conversion to Json, perhaps a reserved or special character in the response that Livy is not filtering out, as the response is otherwise innocuous.

      Livy needs to correctly parse the response and not throw an exception, and also in the case that an exception is thrown, the session should be able to recover from the exception to continue running statements correctly. Following the Json Exception, even a print(1) statement fails to execute properly, necessitating the user get a new session to work with.

      Example follows below.

      ### CREATE A NEW PYSPARK SESSION
      -bash-4.1$ curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions
      {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
      
      ### CHECK THE STATE OF SESSION 2 UNTIL IT GOES FROM "STARTING" STATE TO "IDLE" STATE
      -bash-4.1$ curl localhost:8998/sessions/2
      {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
      -bash-4.1$ curl localhost:8998/sessions/2
      {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"idle","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
      
      ### RUN THE PYSPARK CODE IN SESSION 2, "import subprocess"
      -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"import subprocess"}'
      {"id":0,"state":-X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions
      
      ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
      -bash-4.1$ curl localhost:8998/sessions/2/statements/0
      {"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":""}}}
      ### THE OUTPUT IS {"text/plain":""} WHICH IS EXPECTED AND CORRECT
      
      ### RUN THE PYSPARK CODE IN SESSION 2, "subprocess.call(["hadoop", "fs", "-touchz", "foo.tmp"])"
      -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"subprocess.call([\"hadoop\", \"fs\", \"-touchz\", \"foo.tmp\"])"}'
      {"id":1,"state":"running","output":null}
      
      ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
      -bash-4.1$ curl localhost:8998/sessions/2/statements/1
      {"id":1,"state":"available","output":{"status":"ok","execution_count":1,"data":{"text/plain":"0"}}}
      ### THE OUTPUT IS {"text/plain":"0"} WHICH IS EXPECTED OUTPUT THAT THE TOUCHZ COMPLETED WITH RETURN CODE 0.
      
      ### RUN THE PYSPARK CODE IN SESSION 2, "print(subprocess.check_output(["hadoop", "fs", "-ls", "foo.tmp"]))"
      -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"print(subprocess.check_output([\"hadoop\", \"fs\", \"-ls\", \"foo.tmp\"]))"}'
      {"id":2,"state":"waiting","output":null}
      
      ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
      -bash-4.1$ curl localhost:8998/sessions/2/statements/2
      {"id":2,"state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"-rw-------   3 username group          0 2017-02-23 19:26 foo.tmp"}}}
      ### THE OUTPUT IS {"text/plain":"-rw-------   3 username group          0 2017-02-23 19:26 foo.tmp"} WHICH IS EXPECTED OUTPUT OF DIRECTORY LISTING
      
      ### RUN THE PYSPARK CODE IN SESSION 2, "subprocess.call(["hadoop", "fs", "-rm", "foo.tmp"])"
      -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"subprocess.call([\"hadoop\", \"fs\", \"-rm\", \"foo.tmp\"])"}'
      {"id":3,"state":"waiting","output":null}
      
      ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
      -bash-4.1$ curl localhost:8998/sessions/2/statements/3
      {"id":3,"state":"available","output":{"status":"error","execution_count":3,"ename":"com.fasterxml.jackson.core.JsonParseException","evalue":"Unrecognized token 'Moved': was expecting ('true', 'false' or 'null')\n at [Source: Moved: 'foo.tmp' to trash at: .Trash/Current; line: 1, column: 6]","traceback":[]}}
      ### JSON EXCEPTION APPEARS HERE WHICH IS INCORRECT PARSING OF THE OUTPUT
      
      ### RUN THE PYSPARK CODE IN SESSION 2, "print(1)"
      -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H 'Content-Type: application/json' -d '{"code":"print(1)"}'
      {"id":4,"state":"available","output":null}
      
      ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
      -bash-4.1$ curl localhost:8998/sessions/2/statements/4
      {"id":4,"state":"available","output":{"status":"ok","execution_count":4,"data":{"text/plain":""}}}
      ### THE OUTPUT IS {"text/plain":""} WHICH IS EMPTY STRING, INDICATING OPERATION COMPLETED WITH NO OUTPUT, WHICH IS INCORRECT, IT SHOULD RETURN 1
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            rickbernotas Rick Bernotas
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: