Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
2.1.3, 2.3.3, 2.4.0
-
None
-
None
-
Java 8/Scala 2.11/MacOs
Description
I found a few discrepencies while working with inferSchema set to true in CSV ingestion.
Given the following CSV in the attached books.csv:
id;authorId;title;releaseDate;link 1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA 11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk 15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG 20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W 21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo
And this Java code:
Dataset<Row> df = spark.read().format("csv") .option("header", "true") .option("multiline", true) .option("sep", ";") .option("quote", "*") .option("dateFormat", "M/d/y") .option("inferSchema", true) .load("data/books.csv"); df.show(7); df.printSchema();
In Spark v2.0.1
Output:
+---+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link| +---+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| | 7| 3|Adventures of Huc...|. 5/26/94|http://amzn.to/2w...| +---+--------+--------------------+-----------+--------------------+ only showing top 7 rows Dataframe's schema: root |-- id: integer (nullable = true) |-- authorId: integer (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true)
This is fine and the expected output.
Using Apache Spark v2.1.3
Excerpt of the dataframe content:
+--------------------+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link| +--------------------+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| null| null| |An independent st...|12/28/16|http://amzn.to/2v...| null| null| +--------------------+--------+--------------------+-----------+--------------------+ only showing top 7 rows Dataframe's schema: root |-- id: string (nullable = true) |-- authorId: string (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true)
The multiline option is not recognized. And, of course, the schema is wrong.
Using Apache Spark v2.2.3
Excerpt of the dataframe content:
+---+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link | +---+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...| +---+--------+--------------------+-----------+--------------------+ only showing top 7 rows Dataframe's schema: root |-- id: integer (nullable = true) |-- authorId: integer (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link : string (nullable = true)
The link column has a carriage return at the end of its name. If I run and use:
df.show(7, 90);
I get:
+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+ | id|authorId| title|releaseDate| link | +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+ | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/16|http://amzn.to/2kup94P | | 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP | | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/4/08|http://amzn.to/2kYezqr | | 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n | | 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT | | 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? An independent study by...| 12/28/16|http://amzn.to/2vBxOe1 | | 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav | +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
The carriage return is added to my the last cell.
Same behavior in v2.3.3 and v2.4.0.
If I add the schema, like in:
// Creates the schema StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField( "id", DataTypes.IntegerType, false), DataTypes.createStructField( "authordId", DataTypes.IntegerType, true), DataTypes.createStructField( "bookTitle", DataTypes.StringType, false), DataTypes.createStructField( "releaseDate", DataTypes.DateType, true), // nullable, but this will be ignore DataTypes.createStructField( "url", DataTypes.StringType, false) }); // Reads a CSV file with header, called books.csv, stores it in a dataframe Dataset<Row> df = spark.read().format("csv") .option("header", "true") .option("multiline", true) .option("sep", ";") .option("dateFormat", "M/d/y") .option("quote", "*") .schema(schema) .load("data/books.csv");
The output is matching what is expected in any version except version 2.1.3, where Spark simply crashes.
All the code can be downloaded from GitHub at: https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.