Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40549

PYSPARK: Observation computes the wrong results when using `corr` function

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.0
    • None
    • PySpark
    • Important

    Description

      Minimalistic description of the odd computation results.

      When creating a new `Observation` object and computing a simple correlation function between 2 columns, the results appear to be non-deterministic.

      # Init
      from pyspark.sql import SparkSession, Observation
      import pyspark.sql.functions as F
      
      df = spark.createDataFrame([(float(i), float(i*10),) for i in range(10)], schema="id double, id2 double")
      
      for i in range(10):
          o = Observation(f"test_{i}")
          df_o = df.observe(o, F.corr("id", "id2").eqNullSafe(1.0))
          df_o.count()
          print(o.get)
      
      # Results
      {'(corr(id, id2) <=> 1.0)': False}
      {'(corr(id, id2) <=> 1.0)': False}
      {'(corr(id, id2) <=> 1.0)': False}
      {'(corr(id, id2) <=> 1.0)': True}
      {'(corr(id, id2) <=> 1.0)': True}
      {'(corr(id, id2) <=> 1.0)': True}
      {'(corr(id, id2) <=> 1.0)': True}
      {'(corr(id, id2) <=> 1.0)': True}
      {'(corr(id, id2) <=> 1.0)': True}
      {'(corr(id, id2) <=> 1.0)': False}

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            canimus Herminio Vazquez
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: