[SPARK-22959] Configuration to select the modules for daemon and worker in PySpark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.4.0
Component/s: PySpark
Labels:
None

Description

We are now forced to use pyspark/daemon.py and pyspark/worker.py in PySpark tests.

This doesn't allow a custom modification for it and it's sometimes hard to debug what happens in Python worker processes.

This is actually related with ~~SPARK-7721~~ too as somehow Coverage is unable to detect the coverage from os.fork. If we have some custom fixes to force the coverage, it works fine.

This is also related with ~~SPARK-20368~~. This JIRA describes Sentry supports which (roughly) needs some changes within worker side. With this simple workaround, advanced users will be able to do a lot of pluggable workarounds.

As an example, let's say if I configure the module coverage_daemon and had coverage_daemon.py in the python path:

import os

from pyspark import daemon


if "COVERAGE_PROCESS_START" in os.environ:
    from pyspark.worker import main

    def _cov_wrapped(*args, **kwargs):
        import coverage
        cov = coverage.coverage(
            config_file=os.environ["COVERAGE_PROCESS_START"])
        cov.start()
        try:
            main(*args, **kwargs)
        finally:
            cov.stop()
            cov.save()
    daemon.worker_main = _cov_wrapped


if __name__ == '__main__':
    daemon.manager()

I can leave the main code intact but do some workarounds.

Attachments

Issue Links

is duplicated by

SPARK-20368 Support Sentry on PySpark workers

Resolved

links to

[Github] Pull Request #20151 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Jan/18 12:37

Updated:: 12/Dec/22 18:10

Resolved:: 14/Jan/18 02:27