Description
Steps to reproduce in Python
st = ''.join(['1' for i in range(65537)]) sc.parallelize([st]).saveAsTextFile("testfile") sc.textFile('testfile').count()
The last line never completes.. Looking at the logs (with DEBUG enabled) reveals the exception, here is the stack trace
14/03/25 15:03:34 INFO PythonRDD: stdin writer to Python finished early 14/03/25 15:03:34 DEBUG PythonRDD: stdin writer to Python finished early java.io.UTFDataFormatException: encoded string too long: 65537 bytes at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:222) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:221) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:221) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:81)