[SPARK-48719] Wrong Result in regr_slope&regr_intercept Aggregate with Tuples has NULL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 4.0.0, 3.5.2
Component/s: Spark Core, SQL
Labels:
- pull-request-available

Description

When calculate slope and intercept using regr_slope & regr_intercept aggregate:

(using Java api)

spark.sql("drop table if exists tab");
spark.sql("CREATE TABLE tab(y int, x int) using parquet");
spark.sql("INSERT INTO tab VALUES (1, 1)");
spark.sql("INSERT INTO tab VALUES (2, 3)");
spark.sql("INSERT INTO tab VALUES (3, 5)");
spark.sql("INSERT INTO tab VALUES (NULL, 3)");
spark.sql("INSERT INTO tab VALUES (3, NULL)");
spark.sql("SELECT " +
        "regr_slope(x, y), " +
        "regr_intercept(x, y)" +
        "FROM tab").show();

Spark result:

+------------------+--------------------+
|  regr_slope(x, y)|regr_intercept(x, y)|
+------------------+--------------------+
|1.4545454545454546| 0.09090909090909083|
+------------------+--------------------+

The correct answer should be 2.0 and -1.0 obviously.

Reason:

In sql/catalyst/expressions/aggregate/linearRegression.scala,

case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(right)
......

CovPopulation will filter tuples which right OR left is NULL

But VariancePop will only filter null right expression.

This will cause wrong result when some of the tuples' left is null (and right is not null).

Same reason with RegrIntercept.

A possible fix:

case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(If(And(IsNotNull(left), IsNotNull(right)),
    right, Literal.create(null, right.dataType))) 
.....

same fix to RegrIntercept

Attachments

Issue Links

links to

[Github] Pull Request #47105 (wayneguow)

GitHub Pull Request #47105

GitHub Pull Request #47230

Activity

People

Assignee:: Jonathon Lee

Reporter:: Jonathon Lee

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Jun/24 03:13

Updated:: 08/Jul/24 09:23

Resolved:: 08/Jul/24 09:23