Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35066

Spark 3.1.1 is slower than 3.0.2 by 4-5 times

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.1
    • None
    • ML, SQL
    • None
    • Spark/PySpark: 3.1.1

      Language: Python 3.7.x / Scala 12

      OS: macOS, Linux, and Windows

      Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1

    Description

      Hi,

      The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

       

      spark = SparkSession.builder \
              .master("local[*]") \
              .config("spark.driver.memory", "16G") \
              .config("spark.driver.maxResultSize", "0") \
              .config("spark.serializer",
      "org.apache.spark.serializer.KryoSerializer") \
              .config("spark.kryoserializer.buffer.max", "2000m") \
              .getOrCreate()
      
      Toys = spark.read \
        .parquet('./toys-cleaned').repartition(12)
      
      # tokenize the text
      regexTokenizer = RegexTokenizer(inputCol="reviewText",
      outputCol="all_words", pattern="\\W")
      toys_with_words = regexTokenizer.transform(Toys)
      
      # remove stop words
      remover = StopWordsRemover(inputCol="all_words", outputCol="words")
      toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
      
      all_words = toys_with_tokens.select(explode("words").alias("word"))
      # group by, sort and limit to 50k
      top50k =
      all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
      
      top50k.show()
      

       

      Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked)

      Screenshot of spark 3.1.1 task:

       

      Screenshot of spark 3.0.2 task:

       

      For a longer discussion: Spark User List

       

      You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML.

      Attachments

        1. image-2022-03-17-17-18-36-793.png
          139 kB
          Maziyar PANAHI
        2. image-2022-03-17-17-19-11-655.png
          132 kB
          Maziyar PANAHI
        3. image-2022-03-17-17-19-34-906.png
          124 kB
          Maziyar PANAHI
        4. Screenshot 2021-04-07 at 11.15.48.png
          139 kB
          Maziyar PANAHI
        5. Screenshot 2021-04-08 at 15.08.09.png
          124 kB
          Maziyar PANAHI
        6. Screenshot 2021-04-08 at 15.13.19.png
          132 kB
          Maziyar PANAHI
        7. Screenshot 2021-04-08 at 15.13.19-1.png
          132 kB
          Maziyar PANAHI

        Activity

          People

            Unassigned Unassigned
            maziyar Maziyar PANAHI
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: