Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37055

Apply 'compute.eager_check' across all the codebase

Details

    • Umbrella
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.0
    • None
    • PySpark
    • None

    Description

      As hyukjin.kwon guide

      1 Make every input validation like this covered by the new configuration. For example:

      - a == b
      + def eager_check(f): # Utility function 
      + return not config.compute.eager_check and f() 
      + 
      + eager_check(lambda: a == b)
      

      2 We should check if the output makes sense although the behaviour is not matched with pandas'. If the output does not make sense, we shouldn't cover it with this configuration.

      3 Make this configuration enabled by default so we match the behaviour to pandas' by default.

       

      We have to make sure listing which API is affected in the description of 'compute.eager_check'

      Attachments

        Issue Links

          Activity

            gurwls223 Hyukjin Kwon added a comment - - edited

            dchvn please feel free to create JIRAs as sub-tasks, and proceed. This will be one of the large items in Spark 3.3 .

            gurwls223 Hyukjin Kwon added a comment - - edited dchvn please feel free to create JIRAs as sub-tasks, and proceed. This will be one of the large items in Spark 3.3 .
            gurwls223 Hyukjin Kwon added a comment - - edited

            Oh just for extra clasification, we should only do this when it triggers some Spark jobs (e.g., Series or DataFrame). For scalar or primitive values, we don't need to guard it.

            gurwls223 Hyukjin Kwon added a comment - - edited Oh just for extra clasification, we should only do this when it triggers some Spark jobs (e.g., Series or DataFrame). For scalar or primitive values, we don't need to guard it.
            yikunkero Yikun Jiang added a comment -

            Looks like it's a method level results runtime validation between pandas and pandas-on-spark?

            What's the next step when some one found eager check failed? Report a bug or back to return the pandas results?

            yikunkero Yikun Jiang added a comment - Looks like it's a method level results runtime validation between pandas and pandas-on-spark? What's the next step when some one found eager check failed? Report a bug or back to return the pandas results?
            gurwls223 Hyukjin Kwon added a comment - - edited

            Actually, it's more to prevent running Spark jobs only for the sake of input validation. For example, assume a pandas API requires to have the same values in its input:

            def abc(df):
                if self.sort_values() != df.sort_values()
                    raise Exception("all values have to be same")
            

            and assume that the input df contains a very complicated computation chain. For example:

            df = spark.read.csv().sort().repartition().sort().agg(...)
            
            another_df.abc(df)  # would result in computing `df` two times (+ sort each df).
            

            So, this JIRA aims to have the eager check (enabled by default to match with pandas' behaviour) but provide an option to avoid such expensive computation.

            gurwls223 Hyukjin Kwon added a comment - - edited Actually, it's more to prevent running Spark jobs only for the sake of input validation. For example, assume a pandas API requires to have the same values in its input: def abc(df): if self.sort_values() != df.sort_values() raise Exception( "all values have to be same" ) and assume that the input df contains a very complicated computation chain. For example: df = spark.read.csv().sort().repartition().sort().agg(...) another_df.abc(df) # would result in computing `df` two times (+ sort each df). So, this JIRA aims to have the eager check (enabled by default to match with pandas' behaviour) but provide an option to avoid such expensive computation.
            gurwls223 Hyukjin Kwon added a comment - - edited

            dchvn, just checking - are you working on this?

            gurwls223 Hyukjin Kwon added a comment - - edited dchvn , just checking - are you working on this?
            dchvn dch nguyen added a comment -

            hyukjin.kwon , no, I am not now. I did not find anywhere to apply this conf more

            dchvn dch nguyen added a comment - hyukjin.kwon , no, I am not now. I did not find anywhere to apply this conf more
            gurwls223 Hyukjin Kwon added a comment - - edited You can, for example, find some instances relying on is_moninotically_increasing ( https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/pandas/base.py#L703-L758 ) which is super expensive e.g.) https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5219
            gurwls223 Hyukjin Kwon added a comment - - edited equals is the same too: https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5842
            dchvn dch nguyen added a comment -

            thanks! I will try to address them

            dchvn dch nguyen added a comment - thanks! I will try to address them

            People

              Unassigned Unassigned
              dchvn dch nguyen
              hyukjin.kwon hyukjin.kwon
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: