Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44042

SPIP: PySpark Test Framework

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • PySpark
    • None

    Description

      Currently, there's no official PySpark test framework, but only various open-source repos and blog posts. Many of these open-source resources are very popular, which demonstrates user-demand for PySpark testing capabilities. spark-testing-base has 1.4k stars, and chispa has 532k downloads/month. However, it can be confusing for users to piece together disparate resources to write their own PySpark tests (see The Elephant in the Room: How to Write PySpark Tests). We can streamline and simplify the testing process by incorporating test features, such as a PySpark Test Base class (which allows tests to share Spark sessions) and test util functions (for example, asserting dataframe and schema equality). Please see the full SPIP document attached: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v.

      Attachments

        1.
        Add assertDataFrameEqual util function Sub-task Resolved Amanda Liu Actions
        2.
        Display percent of unequal rows in DataFrame comparison Sub-task Resolved Amanda Liu Actions
        3.
        Add pyspark_testing module for GHA tests Sub-task Resolved Amanda Liu Actions
        4.
        Expose assertDataFrameEqual in pyspark.testing.utils Sub-task Resolved Amanda Liu Actions
        5.
        Make assertSchemaEqual API public Sub-task Resolved Amanda Liu Actions
        6.
        Allow custom precision for fp approx equality Sub-task Resolved Amanda Liu Actions
        7.
        Support List[Row] data type for expected DataFrame argument Sub-task Resolved Amanda Liu Actions
        8.
        Add checks for expected list type special cases Sub-task Resolved Amanda Liu Actions
        9.
        Use difflib to display errors in assertDataFrameEqual Sub-task Resolved Amanda Liu Actions
        10.
        Clarify error for unsupported arg data type in assertDataFrameEqual Sub-task Resolved Amanda Liu Actions
        11.
        Add support for pandas-on-Spark DataFrame assertDataFrameEqual Sub-task Resolved Amanda Liu Actions
        12.
        Fix pandas-on-Spark type checks for assertDataFrameEqual Sub-task Resolved Amanda Liu Actions
        13.
        Customize diff log in assertDataFrameEqual error message format Sub-task Resolved Amanda Liu Actions
        14.
        Update assertDataFrameEqual docs error example output Sub-task Resolved Amanda Liu Actions
        15.
        Add PySparkTestBase unit test class Sub-task Open Unassigned Actions
        16.
        Add pyspark.testing to setup.py Sub-task Resolved Amanda Liu Actions
        17.
        Support comparison between lists of Rows Sub-task Resolved Amanda Liu Actions
        18.
        Publish PySpark Test Guidelines webpage Sub-task Resolved Amanda Liu Actions
        19.
        Raise error when only one df is None Sub-task Resolved Amanda Liu Actions
        20.
        Add support for pandas DataFrame assertDataFrameEqual Sub-task Resolved Amanda Liu Actions
        21.
        Make pandas error class message_parameters strings Sub-task Resolved Amanda Liu Actions
        22.
        Fix PySpark testing guide links Sub-task Resolved Amanda Liu Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            asl3 Amanda Liu
            Holden Karau Holden Karau

            Dates

              Created:
              Updated:

              Slack

                Issue deployment