Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44042

SPIP: PySpark Test Framework

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • PySpark
    • None

    Description

      Currently, there's no official PySpark test framework, but only various open-source repos and blog posts. Many of these open-source resources are very popular, which demonstrates user-demand for PySpark testing capabilities. spark-testing-base has 1.4k stars, and chispa has 532k downloads/month. However, it can be confusing for users to piece together disparate resources to write their own PySpark tests (see The Elephant in the Room: How to Write PySpark Tests). We can streamline and simplify the testing process by incorporating test features, such as a PySpark Test Base class (which allows tests to share Spark sessions) and test util functions (for example, asserting dataframe and schema equality). Please see the full SPIP document attached: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v.

      Attachments

        1.
        Add assertDataFrameEqual util function Sub-task Resolved Amanda Liu
        2.
        Display percent of unequal rows in DataFrame comparison Sub-task Resolved Amanda Liu
        3.
        Add pyspark_testing module for GHA tests Sub-task Resolved Amanda Liu
        4.
        Expose assertDataFrameEqual in pyspark.testing.utils Sub-task Resolved Amanda Liu
        5.
        Make assertSchemaEqual API public Sub-task Resolved Amanda Liu
        6.
        Allow custom precision for fp approx equality Sub-task Resolved Amanda Liu
        7.
        Support List[Row] data type for expected DataFrame argument Sub-task Resolved Amanda Liu
        8.
        Add checks for expected list type special cases Sub-task Resolved Amanda Liu
        9.
        Use difflib to display errors in assertDataFrameEqual Sub-task Resolved Amanda Liu
        10.
        Clarify error for unsupported arg data type in assertDataFrameEqual Sub-task Resolved Amanda Liu
        11.
        Add support for pandas-on-Spark DataFrame assertDataFrameEqual Sub-task Resolved Amanda Liu
        12.
        Fix pandas-on-Spark type checks for assertDataFrameEqual Sub-task Resolved Amanda Liu
        13.
        Customize diff log in assertDataFrameEqual error message format Sub-task Resolved Amanda Liu
        14.
        Update assertDataFrameEqual docs error example output Sub-task Resolved Amanda Liu
        15.
        Add PySparkTestBase unit test class Sub-task Open Unassigned
        16.
        Add pyspark.testing to setup.py Sub-task Resolved Amanda Liu
        17.
        Support comparison between lists of Rows Sub-task Resolved Amanda Liu
        18.
        Publish PySpark Test Guidelines webpage Sub-task Resolved Amanda Liu
        19.
        Raise error when only one df is None Sub-task Resolved Amanda Liu
        20.
        Add support for pandas DataFrame assertDataFrameEqual Sub-task Resolved Amanda Liu
        21.
        Make pandas error class message_parameters strings Sub-task Resolved Amanda Liu
        22.
        Fix PySpark testing guide links Sub-task Resolved Amanda Liu

        Activity

          People

            Unassigned Unassigned
            asl3 Amanda Liu
            Holden Karau Holden Karau
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: