Details
-
Improvement
-
Status: Open
-
P3
-
Resolution: Unresolved
-
2.26.0
-
None
-
None
Description
When using Spark PortableRunner, the job server takes care of translating the Beam pipeline into a Spark job and submitting it to a Spark cluster for execution.
However, simple jobs (e.g. Wordcount) are executed with low parallelism on an actual Spark cluster: this is due to the fact that the stages resulting from the job server translation are split in a very low number of tasks (this is described in detail here: https://stackoverflow.com/questions/64878908/low-parallelism-when-running-apache-beam-wordcount-pipeline-on-spark-with-python).
Investigations have shown that the job server defines explicitly the number of partitions for translated Spark stages based on calls to defaultParallelism, which is however not a robust method for inferring the number of executors and for partitioning Spark jobs (again, see the accepted answer to the above SO issue for the detailed explanation: https://stackoverflow.com/questions/64878908/low-parallelism-when-running-apache-beam-wordcount-pipeline-on-spark-with-python/65616752#65616752).
As of now, this issue prevents the scalability of the job server in a production environment without manually modifying the job server source and recompiling to get around the defaultParallelism issue. Possible suggested solutions (non-exclusive):
- change the job server logic to infer the number of available executors and the number of partitions/tasks in the translated stages in a more robust way;
- allow the user to configure, via pipeline options, the default parallelism to be used by the job server for translating jobs (this is what's done by the Flink portable runner).