Description
Pyspark uses java.io.DataOutputStream.writeUTF to send data to the python world which causes a problem since java.io.DataOutputStream.writeUTF fails if you pass it strings above 64K bytes. Furthermore a fix to this issue is not straight forward since the java to python protocol actually relies on this and uses it as the separator between items. The offending write happens in
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:220
and the reliance on this to separate items can be found in the MUTF8Deserializer class in
python/pyspark/serializers.py:264
The only solution I currently have in mind would be to change the protocol to either extend the number of bytes used to specify the length of the item or to add a boolean flag to every "packet" to indicate wether the item is split into multiple parts (although the second option might result in bad data if multiple things are writing to these steams)