[SPARK-33380] Incorrect output from example script pi.py - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 2.4.6
Fix Version/s: None
Component/s: Examples
Labels:
None

Description

I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 worker nodes. To test the installation, I ran the $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. Three runs produced the following output:

m4-nn:~:spark-submit --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py
Pi is roughly 3.149880
m4-nn:~:spark-submit --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py
Pi is roughly 3.137760
m4-nn:~:spark-submit --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py
Pi is roughly 3.155640

I noted that the computed value of Pi varies with each run.
Next, I ran the same script 3 more times with a higher number of partitions (16). The following output was noted.

m4-nn:~:spark-submit --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py 16
Pi is roughly 3.141100
m4-nn:~:spark-submit --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py 16
Pi is roughly 3.137720
m4-nn:~:spark-submit --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py 16
Pi is roughly 3.145660

Again, I noted that the computed value of Pi varies with each run.

IMO, there are 2 issues with this example script:
1. The output (value of pi) is non-deterministic because the script uses random.random().
2. Specifying the number of partitions (accepted as a command-line argument) has no observable positive impact on the accuracy or precision.

It may be argued that the intent of these examples scripts is simply to demonstrate how to use Spark as well as offer a means to quickly verify an installation. However, we can achieve that objective without compromising on the accuracy or determinism of the computed value. Unless the user examines the script and understands that use of random.random() (to generate random points within the top right quadrant of the circle) as the reason behind the non-determinism, it seems confusing at first that the value varies per run and also that it is inaccurate. Someone may (incorrectly) infer that as a limitation of the framework!

To mitigate this, I wrote an alternate version to compute pi using a partial sum of terms from an infinite series. This script is both deterministic and can produce more accurate output if the user configures it to use more terms. To me, that behavior feels intuitive and logical. I will be happy to share it if it is appropriate.

Best regards,
Milind

Attachments

Issue Links

links to

[Github] Pull Request #30533 (milindvdamle)

Activity

People

Assignee:: Unassigned

Reporter:: Milind V Damle

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Nov/20 05:54

Updated:: 01/Dec/20 04:39

Resolved:: 01/Dec/20 04:39