Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38353

Instrument __enter__ and __exit__ magic methods for pandas API on Spark

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.2.1
    • 3.3.0, 3.2.2
    • PySpark
    • None

    Description

      Create the ticket since instrumenting {}enter{} and {}exit{} magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules.

      For example, for the following code:

       

      pdf = pd.DataFrame(
          [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
      )
      psdf = ps.from_pandas(pdf)
      
      with psdf.spark.cache() as cached_df:
          self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
          self.assert_eq(
              repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True))
          )

       

      pandas-on-Spark usage logger records the internal call self.spark.unpersist() since _enter_ and _exit_ methods of CachedDataFrame are not instrumented.

      Attachments

        Activity

          People

            heyihong Yihong He
            heyihong Yihong He
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: