Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-49218

Caching observed dataframes blocks metric retrieval when the dataframe is empty

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0, 3.5.1, 3.5.2
    • None
    • PySpark, Spark Core
    • None
      • Mac M2
      • Python 3.11
      • PySpark 3.5.2 running in local mode, installed via pip

    Description

      Caching observed dataframes blocks metric retrieval when the dataframe is empty. This issue started in PySpark 3.5.0 and can be reproduced by running the following script, which does not complete:

      from pyspark.sql import SparkSession, Observation, functions as F
      spark = SparkSession.builder.getOrCreate()
      
      df = spark.createDataFrame([], "f: double")
      observation = Observation("count")
      observed_df = df.observe(observation, F.count(F.lit(1))) 
      
      observed_df.cache().collect()
      print(observation.get)

       

      The issue can also be reproduced when reading a CSV with 0 records, or when using additional select statements on the observed dataframe. Removing `cache()` or downgrading to Spark 3.4.3 prints the expected result: `"{'count(1)': 0}"`.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            mpayson Max Payson
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: