Details
-
Bug
-
Status: Triage Needed
-
P3
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I have been trying to run the python word-count example on an AWS EMR cluster. And it does not work.
Things I have tried:
- Running with
python3 py_codes/word_count_beam.py --output word_count_output --runner=SparkRunner
This results in implicitly running with --spark-master-url local[4] which defeats the purpose of running it in a cluster
- Tried
python3 py_codes/word_count_beam.py --output word_count_output --runner=SparkRunner --spark-master-url=yarn
Still uses local master.
- Could not use method described in https://beam.apache.org/documentation/runners/spark/ under "Running on a pre-deployed Spark cluster" because in yarn master is not exposed with an URL like localhost:7077
- Tried
python3 py_codes/word_ount_beam.py --output word_count_output --runner=SparkRunner --output_executable_path=jars/beam_word_count.jar
as described in https://issues.apache.org/jira/browse/BEAM-8970
It can create a jar file, but when I submit the jar with spark-submit I get docker permission denied exception. Possibly related to https://issues.apache.org/jira/browse/BEAM-6020
So, no way to run a python beam code in a yarn spark cluster?
This also means no way to run TFX code (which uses beam) in a yarn cluster.