[SPARK-1043] Pyspark RDD's cannot deal with strings greater than 64K bytes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.9.1
Component/s: PySpark
Labels:
None

Description

Pyspark uses java.io.DataOutputStream.writeUTF to send data to the python world which causes a problem since java.io.DataOutputStream.writeUTF fails if you pass it strings above 64K bytes. Furthermore a fix to this issue is not straight forward since the java to python protocol actually relies on this and uses it as the separator between items. The offending write happens in

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:220

and the reliance on this to separate items can be found in the MUTF8Deserializer class in

python/pyspark/serializers.py:264

The only solution I currently have in mind would be to change the protocol to either extend the number of bytes used to specify the length of the item or to add a boolean flag to every "packet" to indicate wether the item is split into multiple parts (although the second option might result in bad data if multiple things are writing to these steams)

Attachments

Activity

People

Assignee:: Josh Rosen

Reporter:: Erik Selin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Jan/14 13:18

Updated:: 28/Jan/14 21:32

Resolved:: 28/Jan/14 21:32