[SPARK-35066] Spark 3.1.1 is slower than 3.0.2 by 4-5 times - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.1
Fix Version/s: None
Component/s: ML, SQL
Labels:
None
Environment:

Spark/PySpark: 3.1.1

Language: Python 3.7.x / Scala 12

OS: macOS, Linux, and Windows

Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1

Description

Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

Screenshot of spark 3.0.2 task:

For a longer discussion: Spark User List

You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2022-03-17-17-18-36-793.png
17/Mar/22 16:18
139 kB
Maziyar PANAHI
image-2022-03-17-17-19-11-655.png
17/Mar/22 16:19
132 kB
Maziyar PANAHI
image-2022-03-17-17-19-34-906.png
17/Mar/22 16:19
124 kB
Maziyar PANAHI
Screenshot 2021-04-07 at 11.15.48.png
14/Apr/21 07:16
139 kB
Maziyar PANAHI
Screenshot 2021-04-08 at 15.08.09.png
14/Apr/21 07:16
124 kB
Maziyar PANAHI
Screenshot 2021-04-08 at 15.13.19.png
14/Apr/21 07:16
132 kB
Maziyar PANAHI
Screenshot 2021-04-08 at 15.13.19-1.png
17/Mar/22 16:18
132 kB
Maziyar PANAHI

Activity

People

Assignee:: Unassigned

Reporter:: Maziyar PANAHI

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Apr/21 07:12

Updated:: 17/Mar/22 16:29