[SPARK-23970] pyspark - simple filter/select doesn't use all tasks when coalesce is set - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.2.0, 2.2.1
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Running in (py)spark 2.2.

Marking this as PySpark, but have not confirmed whether this is Spark-wide; I've observed it in pyspark which is my preferred API.

df = spark.sql(
"""
select <redacted>
from <inputtbl>
where <conditions
"""
)
df.coalesce(32).write.parquet(...)

The above code will only attempt to use 32 tasks to read and process all of the original input data. This compares to

df = spark.sql(
"""
select <redacted>
from <inputtbl>
where <conditions
"""
).cache()
df.count()
df.coalesce(32).write.parquet(...)

where this will use the full complement of tasks available to the cluster to do the initial filter, with a subsequent shuffle to coalesce and write. The latter execution path is way more efficient, particularly at large volumes where filtering will remove most records and should be the default. Note that in the real setting in which I am running this, I'm operating a 20 node cluster with 16 cores and 56gb RAM per machine, and processing well over a TB of raw data in <inputtbl>. The scale of the task I am testing on generates approximately 300,000 read tasks in the latter version of the code when not constrained by the former's execution plan.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Matthew Anthony

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Apr/18 14:57

Updated:: 12/Dec/22 18:10

Resolved:: 14/Apr/18 14:58