Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23649

CSV schema inferring fails on some UTF-8 chars

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.2.2, 2.3.1, 2.4.0
    • SQL
    • None

    Description

      Schema inferring of CSV files fails if the file contains a char starts from 0xFF. 

      spark.read.option("header", "true").csv("utf8xFF.csv")
      
      java.lang.ArrayIndexOutOfBoundsException: 63
        at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
        at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
      

      Here is content of the file:

      hexdump -C ~/tmp/utf8xFF.csv
      00000000  63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
      00000010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
      00000020  2c 34 35 36 0d                                    |,456.|
      00000025
      

      Schema inferring doesn't fail in multiline mode:

      spark.read.option("header", "true").option("multiline", "true").csv("utf8xFF.csv")
      
      +-------+-----+
      |channel|code
      +-------+-----+
      | United| 123
      | ABGUN�| 456
      +-------+-----+
      

      and Spark is able to read the csv file if the schema is specified:

      import org.apache.spark.sql.types._
      val schema = new StructType().add("channel", StringType).add("code", StringType)
      spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
      
      +-------+----+
      |channel|code|
      +-------+----+
      | United| 123|
      | ABGUN�| 456|
      +-------+----+
      

      Attachments

        1. utf8xFF.csv
          0.0 kB
          Max Gekk

        Activity

          People

            Unassigned Unassigned
            maxgekk Max Gekk
            Herman van Hövell Herman van Hövell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: