Details
Description
When `assert_true` is used after a `left_outer` join the assert exception is raised even though all the rows meet the condition. Using an `inner` join does not expose this issue.
from pyspark.sql import SparkSession from pyspark.sql import functions as sf session = SparkSession.builder.getOrCreate() entries = session.createDataFrame( [ ("a", 1), ("b", 2), ("c", 3), ], ["id", "outcome_id"], ) outcomes = session.createDataFrame( [ (1, 12), (2, 34), (3, 32), ], ["outcome_id", "outcome_value"], ) # Inner join works as expected ( entries.join(outcomes, on="outcome_id", how="inner") .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) .filter(sf.col("valid").isNull()) .show() ) # Left join fails with «'('outcome_value > 10)' is not true!» even though it is the case ( entries.join(outcomes, on="outcome_id", how="left_outer") .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) .filter(sf.col("valid").isNull()) .show() )
Reproduced on `pyspark` versions: `3.2.1`, `3.2.0`, `3.1.2` and `3.1.1`. I am not sure if "native" Spark exposes this issue as well or not, I don't have the knowledge/setup to test that.