Description
Create the ticket since instrumenting {}enter{} and {}exit{} magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules.
For example, for the following code:
pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) )
pandas-on-Spark usage logger records the internal call self.spark.unpersist() since _enter_ and _exit_ methods of CachedDataFrame are not instrumented.