Description
In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant to check if 2 pandas.DataFrames are equal but it seems for later versions of Pandas, this can fail if the DataFrame has an array column. This method can be replaced by assert_frame_equal from pandas.util.testing. This is what it is meant for and it will give a better assertion message as well.
The test failure I have seen is:
====================================================================== ERROR: test_supported_types (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py", line 128, in test_supported_types self.assertPandasEqual(expected1, result1) File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, in assertPandasEqual self.assertTrue(expected.equals(result), msg=msg) File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas ... File "pandas/_libs/lib.pyx", line 523, in pandas._libs.lib.array_equivalent_object ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Attachments
Issue Links
- is related to
-
SPARK-27276 Increase the minimum pyarrow version to 0.12.1
- Resolved
- links to