[SPARK-38353] Instrument __enter__ and __exit__ magic methods for pandas API on Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.2.1
Fix Version/s: 3.3.0, 3.2.2
Component/s: PySpark
Labels:
None

Description

Create the ticket since instrumenting {}enter{} and {}exit{} magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules.

For example, for the following code:

pdf = pd.DataFrame(
    [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
    self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
    self.assert_eq(
        repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True))
    )

pandas-on-Spark usage logger records the internal call self.spark.unpersist() since _enter_ and _exit_ methods of CachedDataFrame are not instrumented.

Attachments

Issue Links

links to

[Github] Pull Request #35687 (heyihong)

Activity

People

Assignee:: Yihong He

Reporter:: Yihong He

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Feb/22 16:19

Updated:: 12/Dec/22 18:10

Resolved:: 03/Mar/22 11:54