Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1323

Job hangs with java.io.UTFDataFormatException when reading strings > 65536 bytes.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 0.9.1, 1.0.0
    • PySpark

    Description

      Steps to reproduce in Python

      st = ''.join(['1' for i in range(65537)])
      sc.parallelize([st]).saveAsTextFile("testfile")
      sc.textFile('testfile').count()
      

      The last line never completes.. Looking at the logs (with DEBUG enabled) reveals the exception, here is the stack trace

      14/03/25 15:03:34 INFO PythonRDD: stdin writer to Python finished early
      14/03/25 15:03:34 DEBUG PythonRDD: stdin writer to Python finished early
      java.io.UTFDataFormatException: encoded string too long: 65537 bytes
              at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
              at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
              at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:222)
              at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:221)
              at scala.collection.Iterator$class.foreach(Iterator.scala:727)
              at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
              at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:221)
              at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:81)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            karthikk Karthik
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: