Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26906

Pyspark RDD Replication Potentially Not Working

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Cannot Reproduce
    • 2.3.2
    • None
    • PySpark, Web UI
    • None

    Description

      Pyspark RDD replication doesn't seem to be functioning properly. Even with a simple example, the UI reports only 1x replication, despite using the flag for 2x replication

      rdd = sc.range(10**9)
      mapped = rdd.map(lambda x: x)
      mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at PythonRDD.scala:52
      
      mapped.count()

       

      Interestingly, if you catch the UI page at just the right time, you see that it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the RDD is replicated, but it is just the UI that is unable to register this.  

      Attachments

        1. spark_ui.png
          333 kB
          Han Altae-Tran

        Activity

          People

            Unassigned Unassigned
            altaeth Han Altae-Tran
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: