[SPARK-1323] Job hangs with java.io.UTFDataFormatException when reading strings > 65536 bytes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.9.1, 1.0.0
Component/s: PySpark
Labels:
- pyspark

Description

Steps to reproduce in Python

st = ''.join(['1' for i in range(65537)])
sc.parallelize([st]).saveAsTextFile("testfile")
sc.textFile('testfile').count()

The last line never completes.. Looking at the logs (with DEBUG enabled) reveals the exception, here is the stack trace

14/03/25 15:03:34 INFO PythonRDD: stdin writer to Python finished early
14/03/25 15:03:34 DEBUG PythonRDD: stdin writer to Python finished early
java.io.UTFDataFormatException: encoded string too long: 65537 bytes
        at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
        at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:222)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:221)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:221)
        at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:81)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Karthik

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Mar/14 15:18

Updated:: 30/Mar/14 05:43

Resolved:: 30/Mar/14 05:43