Description
When reading a below-mentioned data by specifying user-defined schema, exception is not thrown. Refer the details :
Data:
'PatientID','PatientName','TotalBill'
'1000','Patient1','10u000'
'1001','Patient2','30000'
'1002','Patient3','40000'
'1003','Patient4','50000'
'1004','Patient5','60000'
Source code:
Dataset dataset = sparkSession.read().schema(schema)
.option(INFER_SCHEMA, "true")
.option(DELIMITER, ",")
.option(QUOTE, "\"")
.option(MODE, Mode.PERMISSIVE)
.csv(sourceFile);
When we collect the dataset data:
dataset.collectAsList();
Schema1:
[StructField(PatientID,IntegerType,true), StructField(PatientName,StringType,true), StructField(TotalBill,IntegerType,true)]
*Result *: Throws NumerFormatException
Caused by: java.lang.NumberFormatException: For input string: "10u000"
Schema2:
[StructField(PatientID,IntegerType,true), StructField(PatientName,StringType,true), StructField(TotalBill,DoubleType,true)]
Actual Result:
"PatientID": 1000,
"NumberOfVisits": "400",
"TotalBill": 10,
Expected Result: Should throw NumberFormatException for input string "10u000"