Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23649

CSV schema inferring fails on some UTF-8 chars

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.2.2, 2.3.1, 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Schema inferring of CSV files fails if the file contains a char starts from 0xFF. 

      spark.read.option("header", "true").csv("utf8xFF.csv")
      
      java.lang.ArrayIndexOutOfBoundsException: 63
        at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
        at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
      

      Here is content of the file:

      hexdump -C ~/tmp/utf8xFF.csv
      00000000  63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
      00000010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
      00000020  2c 34 35 36 0d                                    |,456.|
      00000025
      

      Schema inferring doesn't fail in multiline mode:

      spark.read.option("header", "true").option("multiline", "true").csv("utf8xFF.csv")
      
      +-------+-----+
      |channel|code
      +-------+-----+
      | United| 123
      | ABGUN�| 456
      +-------+-----+
      

      and Spark is able to read the csv file if the schema is specified:

      import org.apache.spark.sql.types._
      val schema = new StructType().add("channel", StringType).add("code", StringType)
      spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
      
      +-------+----+
      |channel|code|
      +-------+----+
      | United| 123|
      | ABGUN�| 456|
      +-------+----+
      

        Attachments

        1. utf8xFF.csv
          0.0 kB
          Maxim Gekk

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              maxgekk Maxim Gekk
              Shepherd:
              Herman van Hövell
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: