Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42760

The partition of result data frame of join is always 1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.3.2
    • None
    • PySpark, SQL
    • None
    • standard spark 3.0.3/3.3.2, using in jupyter notebook, local mode

    Description

      I am using pyspark. The partition of result data frame of join is always 1.

      Here is my code from https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join

       

      print(spark.version)

      def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
          spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
          spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
          df1 = spark.range(1, 1000).repartition(data_partitions)
          df2 = spark.range(1, 2000).repartition(data_partitions)
          df3 = spark.range(1, 3000).repartition(data_partitions)

          print("Data partitions is: {}. Shuffle partitions is {}".format(data_partitions, shuffle_partitions))
          print("Data partitions before join: {}".format(df1.rdd.getNumPartitions()))

          df = (df1.join(df2, df1.id == df2.id)
                .join(df3, df1.id == df3.id))

          print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))

      example_shuffle_partitions()

       

      In Spark 3.0.3, it prints out:
      3.0.3
      Data partitions is: 10. Shuffle partitions is 4
      Data partitions before join: 10
      Data partitions after join : 4

      However, it prints out the following in the latest 3.3.2
      3.3.2
      Data partitions is: 10. Shuffle partitions is 4
      Data partitions before join: 10
      Data partitions after join : 1

      Attachments

        Activity

          People

            Unassigned Unassigned
            bdaiab binyang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: