Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.5.0, 3.5.1, 3.5.2
-
None
-
None
-
- Mac M2
- Python 3.11
- PySpark 3.5.2 running in local mode, installed via pip
Description
Caching observed dataframes blocks metric retrieval when the dataframe is empty. This issue started in PySpark 3.5.0 and can be reproduced by running the following script, which does not complete:
from pyspark.sql import SparkSession, Observation, functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([], "f: double") observation = Observation("count") observed_df = df.observe(observation, F.count(F.lit(1))) observed_df.cache().collect() print(observation.get)
The issue can also be reproduced when reading a CSV with 0 records, or when using additional select statements on the observed dataframe. Removing `cache()` or downgrading to Spark 3.4.3 prints the expected result: `"{'count(1)': 0}"`.