Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19872

UnicodeDecodeError in Pyspark on sc.textFile read with repartition

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.1.1, 2.2.0
    • Component/s: PySpark
    • Labels:
      None
    • Environment:

      Mac and EC2

      Description

      I'm receiving the following traceback:

      >>> sc.textFile('test.txt').repartition(10).collect()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket
          for item in serializer.load_stream(rf):
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream
          yield self.loads(stream)
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads
          return s.decode("utf-8") if self.use_unicode else s
        File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode
          return codecs.utf_8_decode(input, errors, True)
      UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
      

      I created a textfile (text.txt) with standard linux newlines:

      a
      b
      
      d
      e
      f
      g
      h
      i
      j
      k
      l
      
      

      I think ran pyspark:

      $ pyspark
      Python 2.7.13 (default, Dec 18 2016, 07:03:39)
      [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
      Setting default log level to "WARN".
      To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
      17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
            /_/
      
      Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
      SparkSession available as 'spark'.
      >>> sc.textFile('test.txt').collect()
      [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
      >>> sc.textFile('test.txt', use_unicode=False).collect()
      ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
      >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
      ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
      >>> sc.textFile('test.txt').repartition(10).collect()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket
          for item in serializer.load_stream(rf):
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream
          yield self.loads(stream)
        File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads
          return s.decode("utf-8") if self.use_unicode else s
        File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode
          return codecs.utf_8_decode(input, errors, True)
      UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
      

      This really looks like a bug in the `serializers.py` code.

        Attachments

          Activity

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              bbruggeman Brian Bruggeman
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: