Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32961

PySpark CSV read with UTF-16 encoding is not working correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.4.4, 3.0.1
    • None
    • SQL
    • both spark local and cluster mode

    Description

      There are weird characters in the output when printing out to console or writing to files.

      Find attached files to see how it look in Spark Dataframe and Pandas Dataframe.

       

      Attachments

        1. pandas df.png
          620 kB
          Bui Bao Anh
        2. pyspark df.png
          314 kB
          Bui Bao Anh
        3. pyspark utf-16le.png
          683 kB
          Bui Bao Anh
        4. pyspark utf-16 with multiline csv.png
          564 kB
          Bui Bao Anh
        5. sendo_sample.csv
          346 kB
          Bui Bao Anh

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bbanh Bui Bao Anh
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: