Resolution: Not A Problem
Affects Version/s: 2.3.3, 2.4.3
Fix Version/s: None
pyarrow 0.10.0 0.14.0
python 2.7 3.5 3.6
Since Spark 2.3.x, pandas udf has been introduced as default ser/des method when using udf. However, an issue raises with python >= 3.5.x version.
We use pandas udf to process batches of data, but we find the data is incomplete in python 3.x. At first , i think the process logical maybe wrong, so i change the code to very simple one and it has the same problem.After investigate for a week, i find it is related to pyarrow.
1. prepare data
The data have seven column, a、b、c、d、e、f and g, data type is Integer
produce 100,000 rows and name the file test.csv ,upload to hdfs, then load it , and repartition it to 1 partition.
2.register pandas udf
3.apply pandas udf
4.execute it in pyspark (local or yarn)
run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=100000. As mentioned before the total row number is 1000000, it should print "iterator one time " 10 times.
(1)Python 2.7 envs:
The result is right, 10 times of print.
(2)Python 3.5 or 3.6 envs:
The data is incomplete. Exception is print by jvm spark which have been added by us , I will explain it later.
The “process done” is added in the worker.py.
In order to get the exception, change the spark code, the code is under core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to print the exception.
It seems the pyspark get the data from jvm , but pyarrow get the data incomplete. Pyarrow side think the data is finished, then shutdown the socket. At the same time, the jvm side still writes to the same socket , but get socket close exception.
The pyarrow part is in ipc.pxi:
read_next_batch function get NULL, think the iterator is over.
Our environment is spark 2.4.3, we have tried pyarrow version 0.10.0 and 0.14.0 , python version is python 2.7, python 3.5, python 3.6.
When using python 2.7, everything is fine. But when change to python 3.5,3,6, the data is wrong.
The column number is critical to trigger this bug, if column number is less than 5 , this bug probably will not happen. But If the column number is big , for example 7 or above, it happens every time.
So we wonder if there is some conflict between python 3.x and pyarrow version?
I have put the code and data as attachment.