Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31945

Make more cache enable for Python UDFs.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.1.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Currently the cache manager doesn't use the cache for udf if the udf is created again even if the functions is the same.

      >>> func = lambda x: x
      
      >>> df = spark.range(1)
      >>> df.select(udf(func)("id")).cache()
      
      >>> df.select(udf(func)("id")).explain()
      == Physical Plan ==
      *(2) Project [pythonUDF0#14 AS <lambda>(id)#12]
      +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14]
       +- *(1) Range (0, 1, step=1, splits=12)
      

       

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              ueshin Takuya Ueshin
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: