Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17728

UDFs are run too many times

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 2.0.0
    • None
    • Spark Core
    • None
    • Databricks Cloud / Spark 2.0.0

    Description

      Background

      Llonger running processes that might run analytics or contact external services from UDFs. The response might not just be a field, but instead a structure of information. When attempting to break out this information, it is critical that query is optimized correctly.

      Steps to Reproduce

      1. Create some sample data.
      2. Create a UDF that returns a multiple attributes.
      3. Run UDF over some data.
      4. Create new columns from the multiple attributes.
      5. Observe run time.

      Actual Results

      The UDF is executed multiple times per row.

      Expected Results

      The UDF should only be executed once per row.

      Workaround

      Cache the Dataset after UDF execution.

      Details

      For code and more details, see over_optimized_udf.html

      Attachments

        1. over_optimized_udf.html
          43 kB
          Jacob Eisinger

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jeisinge Jacob Eisinger
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: