[SPARK-37055] Apply 'compute.eager_check' across all the codebase - ASF JIRA

Details

Type: Umbrella
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

1 Make every input validation like this covered by the new configuration. For example:

- a == b
+ def eager_check(f): # Utility function 
+ return not config.compute.eager_check and f() 
+ 
+ eager_check(lambda: a == b)

2 We should check if the output makes sense although the behaviour is not matched with pandas'. If the output does not make sense, we shouldn't cover it with this configuration.

3 Make this configuration enabled by default so we match the behaviour to pandas' by default.

We have to make sure listing which API is affected in the description of 'compute.eager_check'

Attachments

Issue Links

is a parent of

SPARK-36968 ps.Series.dot raise "matrices are not aligned" if index is not same

Resolved

relates to

SPARK-37002 Introduce the 'compute.eager_check' option

Resolved

Sub-Tasks

Skip equal check in LocIndexer._select_cols_by_iterable if eager_check disable

In Progress

Unassigned

Activity

Ascending order - Click to sort in descending order

Hyukjin Kwon added a comment - 19/Oct/21 07:17 - edited

dchvn please feel free to create JIRAs as sub-tasks, and proceed. This will be one of the large items in Spark 3.3 .

Hyukjin Kwon added a comment - 19/Oct/21 07:17 - edited dchvn please feel free to create JIRAs as sub-tasks, and proceed. This will be one of the large items in Spark 3.3 .

Hyukjin Kwon added a comment - 19/Oct/21 07:23 - edited

Oh just for extra clasification, we should only do this when it triggers some Spark jobs (e.g., Series or DataFrame). For scalar or primitive values, we don't need to guard it.

Hyukjin Kwon added a comment - 19/Oct/21 07:23 - edited Oh just for extra clasification, we should only do this when it triggers some Spark jobs (e.g., Series or DataFrame). For scalar or primitive values, we don't need to guard it.

Yikun Jiang added a comment - 20/Oct/21 09:04

Looks like it's a method level results runtime validation between pandas and pandas-on-spark?

What's the next step when some one found eager check failed? Report a bug or back to return the pandas results?

Yikun Jiang added a comment - 20/Oct/21 09:04 Looks like it's a method level results runtime validation between pandas and pandas-on-spark? What's the next step when some one found eager check failed? Report a bug or back to return the pandas results?

Hyukjin Kwon added a comment - 21/Oct/21 01:38 - edited

Actually, it's more to prevent running Spark jobs only for the sake of input validation. For example, assume a pandas API requires to have the same values in its input:

def abc(df):
    if self.sort_values() != df.sort_values()
        raise Exception("all values have to be same")

and assume that the input df contains a very complicated computation chain. For example:

df = spark.read.csv().sort().repartition().sort().agg(...)

another_df.abc(df)  # would result in computing `df` two times (+ sort each df).

So, this JIRA aims to have the eager check (enabled by default to match with pandas' behaviour) but provide an option to avoid such expensive computation.

Hyukjin Kwon added a comment - 21/Oct/21 01:38 - edited Actually, it's more to prevent running Spark jobs only for the sake of input validation. For example, assume a pandas API requires to have the same values in its input: def abc(df): if self.sort_values() != df.sort_values() raise Exception( "all values have to be same" ) and assume that the input df contains a very complicated computation chain. For example: df = spark.read.csv().sort().repartition().sort().agg(...) another_df.abc(df) # would result in computing `df` two times (+ sort each df). So, this JIRA aims to have the eager check (enabled by default to match with pandas' behaviour) but provide an option to avoid such expensive computation.

Hyukjin Kwon added a comment - 29/Nov/21 04:44 - edited

dchvn, just checking - are you working on this?

Hyukjin Kwon added a comment - 29/Nov/21 04:44 - edited dchvn , just checking - are you working on this?

dch nguyen added a comment - 29/Nov/21 05:01

hyukjin.kwon , no, I am not now. I did not find anywhere to apply this conf more

dch nguyen added a comment - 29/Nov/21 05:01 hyukjin.kwon , no, I am not now. I did not find anywhere to apply this conf more

Hyukjin Kwon added a comment - 29/Nov/21 05:28 - edited

You can, for example, find some instances relying on is_moninotically_increasing (https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/pandas/base.py#L703-L758) which is super expensive e.g.) https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5219

Hyukjin Kwon added a comment - 29/Nov/21 05:28 - edited You can, for example, find some instances relying on is_moninotically_increasing ( https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/pandas/base.py#L703-L758 ) which is super expensive e.g.) https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5219

Hyukjin Kwon added a comment - 29/Nov/21 05:29 - edited

equals is the same too: https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5842

Hyukjin Kwon added a comment - 29/Nov/21 05:29 - edited equals is the same too: https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5842

dch nguyen added a comment - 29/Nov/21 06:11

thanks! I will try to address them

dch nguyen added a comment - 29/Nov/21 06:11 thanks! I will try to address them

People

Assignee:: Unassigned

Reporter:: dch nguyen

Shepherd:: hyukjin.kwon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Oct/21 06:00

Updated:: 12/Dec/22 18:11