Details
Description
I am using pyspark. The partition of result data frame of join is always 1.
Here is my code from https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
print(spark.version)
def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
df1 = spark.range(1, 1000).repartition(data_partitions)
df2 = spark.range(1, 2000).repartition(data_partitions)
df3 = spark.range(1, 3000).repartition(data_partitions)
print("Data partitions is: {}. Shuffle partitions is {}".format(data_partitions, shuffle_partitions))
print("Data partitions before join: {}".format(df1.rdd.getNumPartitions()))
df = (df1.join(df2, df1.id == df2.id)
.join(df3, df1.id == df3.id))
print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
example_shuffle_partitions()
In Spark 3.0.3, it prints out:
3.0.3
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 4
However, it prints out the following in the latest 3.3.2
3.3.2
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 1