[SPARK-22505] toDF() / createDataFrame() type inference doesn't work as expected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.2.0, 2.3.0
Fix Version/s: None
Component/s: PySpark, Spark Core
Labels:
- csvparser
- inference
- pyspark
- schema
- spark-sql

Description

df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
df.printSchema()

produces

root
 |-- should_be_int: string (nullable = true)
 |-- should_be_str: string (nullable = true)

Notice `should_be_int` has `string` datatype, according to documentation:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.

Schema inference works as expected when reading delimited files like

spark.read.format('csv').option('inferSchema', True)...

but not when using toDF() / createDataFrame() API calls.

Spark 2.2.

Attachments

Issue Links

relates to

SPARK-15463 Support for creating a dataframe from CSV in Dataset[String]

Resolved

SPARK-22112 Add missing method to pyspark api: spark.read.csv(Dataset<String>)

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ruslan Dautkhanov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Nov/17 07:58

Updated:: 12/Dec/22 18:11

Resolved:: 29/Oct/18 06:43