Description
Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0.
The problem: If I want to run a python script as a standalone app, the docs say I should execute the command "pyspark myscript.py". This works as long as IPYTHON=0. But if IPYTHON=1 this doesn't work.
This problem arose for me because I tried to save myself typing by setting IPYTHON=1 in my shell profile script. Which then meant I was unable to execute pyspark standalone scripts.
My analysis:
in the pyspark script, command line arguments are simply ignored if ipython is used:
if [[ "$IPYTHON" = "1" ]] ; then exec ipython $IPYTHON_OPTS else exec "$PYSPARK_PYTHON" "$@" fi
I thought I could get around this by changing the script to pass $@. However, this doesn't work: doing so results in an error saying multiple spark contexts can't be run at once.
This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP environment variable. the pyspark script sets this variable to point to the python/shell.py script, which initializes the Spark Context. In regular python, the PYTHONSTARTUP script runs ONLY if python is invoked in interactive mode; if run with a script, it ignores the variable. iPython runs that script every time, regardless. Which means it will always execute Spark's shell.py script to initialize the spark context even when it was invoked with a script.
Proposed solution:
short term: add this information to the Spark docs regarding iPython. Something like "Note, iPython can only be used interactively. Use regular Python to execute pyspark script files."
long term: change the pyspark script to tell if arguments are passed in; if so, just call python instead of pyspark, or don't set the PYTHONSTARTUP variable? Or maybe fix shell.py to detect if it's being invoked in non-interactively and not initialize sc.