Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31945

Make more cache enable for Python UDFs.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.1.0
    • PySpark
    • None

    Description

      Currently the cache manager doesn't use the cache for udf if the udf is created again even if the functions is the same.

      >>> func = lambda x: x
      
      >>> df = spark.range(1)
      >>> df.select(udf(func)("id")).cache()
      
      >>> df.select(udf(func)("id")).explain()
      == Physical Plan ==
      *(2) Project [pythonUDF0#14 AS <lambda>(id)#12]
      +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14]
       +- *(1) Range (0, 1, step=1, splits=12)
      

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ueshin Takuya Ueshin
            ueshin Takuya Ueshin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment