Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33380

Incorrect output from example script pi.py

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.4.6
    • None
    • Examples
    • None

    Description

       
      I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 worker nodes. To test the installation, I ran the $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. Three runs produced the following output:
       
      m4-nn:~:spark-submit  --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py
      Pi is roughly 3.149880
      m4-nn:~:spark-submit  --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py
      Pi is roughly 3.137760
      m4-nn:~:spark-submit  --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py
      Pi is roughly 3.155640
       
      I noted that the computed value of Pi varies with each run.
      Next, I ran the same script 3 more times with a higher number of partitions (16). The following output was noted.

      m4-nn:~:spark-submit  --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py 16
      Pi is roughly 3.141100
      m4-nn:~:spark-submit  --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py 16
      Pi is roughly 3.137720
      m4-nn:~:spark-submit  --master spark://10.0.0.20:7077  /usr/local/spark/examples/src/main/python/pi.py 16
      Pi is roughly 3.145660
       
      Again, I noted that the computed value of Pi varies with each run. 
       
      IMO, there are 2 issues with this example script:
      1. The output (value of pi) is non-deterministic because the script uses random.random(). 
      2. Specifying the number of partitions (accepted as a command-line argument) has no observable positive impact on the accuracy or precision. 
       
      It may be argued that the intent of these examples scripts is simply to demonstrate how to use Spark as well as offer a means to quickly verify an installation. However, we can achieve that objective without compromising on the accuracy or determinism of the computed value. Unless the user examines the script and understands that use of random.random() (to generate random points within the top right quadrant of the circle) as the reason behind the non-determinism, it seems confusing at first that the value varies per run and also that it is inaccurate. Someone may (incorrectly) infer that as a limitation of the framework!
       
      To mitigate this, I wrote an alternate version to compute pi using a partial sum of terms from an infinite series. This script is both deterministic and can produce more accurate output if the user configures it to use more terms. To me, that behavior feels intuitive and logical. I will be happy to share it if it is appropriate.
       
      Best regards,
      Milind
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            milindvdamle Milind V Damle
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: