[SPARK-40549] PYSPARK: Observation computes the wrong results when using `corr` function - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: PySpark
Labels:
- correctness
Environment:
Hide

// lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.1 LTS Release: 22.04 Codename: jammy

// python -V python 3.10.4

// lshw -class cpu *-cpu description: CPU product: AMD Ryzen 9 3900X 12-Core Processor vendor: Advanced Micro Devices [AMD] physical id: f bus info: cpu@0 version: 23.113.0 serial: Unknown slot: AM4 size: 2194MHz capacity: 4672MHz width: 64 bits clock: 100MHz capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq configuration: cores=12 enabledcores=12 microcode=141561875 threads=24
Show
// lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.1 LTS Release: 22.04 Codename: jammy // python -V python 3.10.4 // lshw - class cpu *-cpu description: CPU product: AMD Ryzen 9 3900X 12-Core Processor vendor: Advanced Micro Devices [AMD] physical id: f bus info: cpu@0 version: 23.113.0 serial: Unknown slot: AM4 size: 2194MHz capacity: 4672MHz width: 64 bits clock: 100MHz capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq configuration: cores=12 enabledcores=12 microcode=141561875 threads=24

Flags:

Important
Language:
- Python

Description

Minimalistic description of the odd computation results.

When creating a new `Observation` object and computing a simple correlation function between 2 columns, the results appear to be non-deterministic.

# Init
from pyspark.sql import SparkSession, Observation
import pyspark.sql.functions as F

df = spark.createDataFrame([(float(i), float(i*10),) for i in range(10)], schema="id double, id2 double")

for i in range(10):
    o = Observation(f"test_{i}")
    df_o = df.observe(o, F.corr("id", "id2").eqNullSafe(1.0))
    df_o.count()
    print(o.get)

# Results
{'(corr(id, id2) <=> 1.0)': False}
{'(corr(id, id2) <=> 1.0)': False}
{'(corr(id, id2) <=> 1.0)': False}
{'(corr(id, id2) <=> 1.0)': True}
{'(corr(id, id2) <=> 1.0)': True}
{'(corr(id, id2) <=> 1.0)': True}
{'(corr(id, id2) <=> 1.0)': True}
{'(corr(id, id2) <=> 1.0)': True}
{'(corr(id, id2) <=> 1.0)': True}
{'(corr(id, id2) <=> 1.0)': False}

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Herminio Vazquez

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Sep/22 12:10

Updated:: 02/Feb/24 17:52