[SPARK-34115] Long runtime on many environment variables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0, 2.4.7, 3.0.1
Fix Version/s: 3.0.2, 3.1.1
Component/s: Spark Core, SQL
Labels:
None
Environment:

Spark 2.4.0 local[2] on a Kubernetes Pod

Description

I am not sure if this is a bug report or a feature request. The code is is the same in current versions of Spark and maybe this ticket saves someone some time for debugging.

We migrated some older code to Spark 2.4.0, and suddently the integration tests on our build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame analyzing in the following functions

AnalysisHelper.assertNotAnalysisRule calling
Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed all services as environment variables, so it had more than 3000 environment variables.

As Utils.isTesting is called very often throgh AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, transformUp).

Of course we will restrict the number of environment variables, on the other side Utils.isTesting could also use a lazy val for

sys.env.contains("SPARK_TESTING")

to not make it that expensive.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

spark-bug-34115.tar.gz
15/Jan/21 08:51
1 kB
Norbert Schultz

Issue Links

links to

[Github] Pull Request #31244 (nob13)

Activity

People

Assignee:: Norbert Schultz

Reporter:: Norbert Schultz

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Jan/21 12:40

Updated:: 12/Dec/22 18:11

Resolved:: 20/Jan/21 00:40