Details
Description
df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str']) df.printSchema()
produces
root |-- should_be_int: string (nullable = true) |-- should_be_str: string (nullable = true)
Notice `should_be_int` has `string` datatype, according to documentation:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
Schema inference works as expected when reading delimited files like
spark.read.format('csv').option('inferSchema', True)...
but not when using toDF() / createDataFrame() API calls.
Spark 2.2.
Attachments
Issue Links
- relates to
-
SPARK-15463 Support for creating a dataframe from CSV in Dataset[String]
- Resolved
-
SPARK-22112 Add missing method to pyspark api: spark.read.csv(Dataset<String>)
- Resolved