Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22505

toDF() / createDataFrame() type inference doesn't work as expected

    Details

      Description

      df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
      df.printSchema()
      

      produces

      root
       |-- should_be_int: string (nullable = true)
       |-- should_be_str: string (nullable = true)
      

      Notice `should_be_int` has `string` datatype, according to documentation:
      https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

      Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.

      Schema inference works as expected when reading delimited files like

      spark.read.format('csv').option('inferSchema', True)...
      

      but not when using toDF() / createDataFrame() API calls.

      Spark 2.2.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Tagar Ruslan Dautkhanov
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: