Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26972

Issue with CSV import and inferSchema set to true

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.1.3, 2.3.3, 2.4.0
    • None
    • Input/Output
    • None
    • Java 8/Scala 2.11/MacOs

    Description

       

      I found a few discrepencies while working with inferSchema set to true in CSV ingestion.

      Given the following CSV in the attached books.csv:

      id;authorId;title;releaseDate;link
      1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P
      2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
      3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr
      4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
      5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
      6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
      An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1
      7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
      8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
      10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
      11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I
      12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
      13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
      14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
      15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn
      16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
      17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL
      18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
      19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
      20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W
      21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc
      22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
      23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo

      And this Java code:

      Dataset<Row> df = spark.read().format("csv")
       .option("header", "true")
       .option("multiline", true)
       .option("sep", ";")
       .option("quote", "*")
       .option("dateFormat", "M/d/y")
       .option("inferSchema", true)
       .load("data/books.csv");
      df.show(7);
      df.printSchema();
      

      In Spark v2.0.1

      Output: 

      +---+--------+--------------------+-----------+--------------------+
      | id|authorId|               title|releaseDate|                link|
      +---+--------+--------------------+-----------+--------------------+
      |  1|       1|Fantastic Beasts ...|   11/18/16|http://amzn.to/2k...|
      |  2|       1|Harry Potter and ...|    10/6/15|http://amzn.to/2l...|
      |  3|       1|The Tales of Beed...|    12/4/08|http://amzn.to/2k...|
      |  4|       1|Harry Potter and ...|    10/4/16|http://amzn.to/2k...|
      |  5|       2|Informix 12.10 on...|    4/23/17|http://amzn.to/2i...|
      |  6|       2|Development Tools...|   12/28/16|http://amzn.to/2v...|
      |  7|       3|Adventures of Huc...|.   5/26/94|http://amzn.to/2w...|
      +---+--------+--------------------+-----------+--------------------+
      only showing top 7 rows
      
      Dataframe's schema:
      root
      |-- id: integer (nullable = true)
      |-- authorId: integer (nullable = true)
      |-- title: string (nullable = true)
      |-- releaseDate: string (nullable = true)
      |-- link: string (nullable = true)
      

      This is fine and the expected output.

      Using Apache Spark v2.1.3

      Excerpt of the dataframe content: 

      +--------------------+--------+--------------------+-----------+--------------------+
      | id|authorId| title|releaseDate| link|
      +--------------------+--------+--------------------+-----------+--------------------+
      | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
      | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
      | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
      | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
      | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
      | 6| 2|Development Tools...| null| null|
      |An independent st...|12/28/16|http://amzn.to/2v...| null| null|
      +--------------------+--------+--------------------+-----------+--------------------+
      only showing top 7 rows
      
      Dataframe's schema:
      root
      |-- id: string (nullable = true)
      |-- authorId: string (nullable = true)
      |-- title: string (nullable = true)
      |-- releaseDate: string (nullable = true)
      |-- link: string (nullable = true)

       The multiline option is not recognized. And, of course, the schema is wrong.

      Using Apache Spark v2.2.3

      Excerpt of the dataframe content: 

      +---+--------+--------------------+-----------+--------------------+
      | id|authorId| title|releaseDate| link
      |
      +---+--------+--------------------+-----------+--------------------+
      | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
      | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
      | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
      | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
      | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
      | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
      | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
      +---+--------+--------------------+-----------+--------------------+
      only showing top 7 rows
      
      Dataframe's schema:
      root
      |-- id: integer (nullable = true)
      |-- authorId: integer (nullable = true)
      |-- title: string (nullable = true)
      |-- releaseDate: string (nullable = true)
      |-- link
      : string (nullable = true)
      

       The link column has a carriage return at the end of its name. If I run and use: 

      df.show(7, 90);
      

      I get: 

      +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
      | id|authorId| title|releaseDate| link
      |
      +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
      | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/16|http://amzn.to/2kup94P
      |
      | 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP
      |
      | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/4/08|http://amzn.to/2kYezqr
      |
      | 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n
      |
      | 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT
      |
      | 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
      An independent study by...| 12/28/16|http://amzn.to/2vBxOe1
      |
      | 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav
      |
      +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
      

      The carriage return is added to my the last cell.

      Same behavior in v2.3.3 and v2.4.0.

      If I add the schema, like in: 

          // Creates the schema
          StructType schema = DataTypes.createStructType(new StructField[] {
              DataTypes.createStructField(
                  "id",
                  DataTypes.IntegerType,
                  false),
              DataTypes.createStructField(
                  "authordId",
                  DataTypes.IntegerType,
                  true),
              DataTypes.createStructField(
                  "bookTitle",
                  DataTypes.StringType,
                  false),
              DataTypes.createStructField(
                  "releaseDate",
                  DataTypes.DateType,
                  true), // nullable, but this will be ignore
              DataTypes.createStructField(
                  "url",
                  DataTypes.StringType,
                  false) });
      
          // Reads a CSV file with header, called books.csv, stores it in a dataframe
          Dataset<Row> df = spark.read().format("csv")
              .option("header", "true")
              .option("multiline", true)
              .option("sep", ";")
              .option("dateFormat", "M/d/y")
              .option("quote", "*")
              .schema(schema)
              .load("data/books.csv");
      

      The output is matching what is expected in any version except version 2.1.3, where Spark simply crashes.

      All the code can be downloaded from GitHub at: https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.

       

       

      Attachments

        1. books.csv
          2 kB
          Jean Georges Perrin
        2. ComplexCsvToDataframeApp.java
          1 kB
          Jean Georges Perrin
        3. ComplexCsvToDataframeWithSchemaApp.java
          2 kB
          Jean Georges Perrin
        4. issue.txt
          8 kB
          Jean Georges Perrin
        5. pom.xml
          2 kB
          Jean Georges Perrin

        Activity

          People

            Unassigned Unassigned
            jgp Jean Georges Perrin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: