Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1043

Pyspark RDD's cannot deal with strings greater than 64K bytes.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 0.9.1
    • PySpark
    • None

    Description

      Pyspark uses java.io.DataOutputStream.writeUTF to send data to the python world which causes a problem since java.io.DataOutputStream.writeUTF fails if you pass it strings above 64K bytes. Furthermore a fix to this issue is not straight forward since the java to python protocol actually relies on this and uses it as the separator between items. The offending write happens in

      core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:220

      and the reliance on this to separate items can be found in the MUTF8Deserializer class in

      python/pyspark/serializers.py:264

      The only solution I currently have in mind would be to change the protocol to either extend the number of bytes used to specify the length of the item or to add a boolean flag to every "packet" to indicate wether the item is split into multiple parts (although the second option might result in bad data if multiple things are writing to these steams)

      Attachments

        Activity

          People

            joshrosen Josh Rosen
            tyro89 Erik Selin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: