[SPARK-26906] Pyspark RDD Replication Potentially Not Working - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Cannot Reproduce
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: PySpark, Web UI
Labels:
None
Environment:

Hide

I am using Google Cloud's Dataproc version 1.3.19-deb9 2018/12/14 (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with python version 3.7. PySpark shell is activated using pyspark --num-executors = 100

Show
I am using Google Cloud's Dataproc version 1.3.19-deb9 2018/12/14 (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with python version 3.7. PySpark shell is activated using pyspark --num-executors = 100

Description

Pyspark RDD replication doesn't seem to be functioning properly. Even with a simple example, the UI reports only 1x replication, despite using the flag for 2x replication

rdd = sc.range(10**9)
mapped = rdd.map(lambda x: x)
mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at PythonRDD.scala:52

mapped.count()

Interestingly, if you catch the UI page at just the right time, you see that it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the RDD is replicated, but it is just the UI that is unable to register this.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

spark_ui.png
17/Feb/19 06:47
333 kB
Han Altae-Tran

Activity

People

Assignee:: Unassigned

Reporter:: Han Altae-Tran

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Feb/19 06:38

Updated:: 03/Mar/19 18:15

Resolved:: 03/Mar/19 18:15