[SPARK-18709] Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields. - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.2, 1.6.3
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
- bug

Flags:

Important

Description

When converting an RDD with a `float` type field to a spark dataframe with an `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently convert the field values to nulls instead of throwing an error like `LongType can not accept object ___ in type <type 'float'>`. However, this seems to be fixed in Spark 2.0.2.

The following example should make the problem clear:

from pyspark.sql.types import StructField, StructType, LongType, DoubleType

schema = StructType([
        StructField("0", LongType(), True),
        StructField("1", DoubleType(), True),
    ])

data = [[1.0, 1.0], [nan, 2.0]]
spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
spark_df.show()

Instead of throwing an error like:

LongType can not accept object 1.0 in type <type 'float'>

Spark converts all the values in the first column to nulls

Running `spark_df.show()` gives:

+----+---+
|   0|  1|
+----+---+
|null|1.0|
|null|1.0|
+----+---+

For the purposes of my computation, I'm doing a `mapPartitions` on a spark data frame, and for each partition, converting it into a pandas data frame, doing a few computations on this pandas dataframe and the return value will be a list of lists, which is converted to an RDD while being returned from 'mapPartitions' (for all partitions). This RDD is then converted into a spark dataframe similar to the example above, using `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should be converted to a `LongType` in the spark data frame, but since it has missing values, it is a `float` type. When spark tries to create the data frame, it converts all the values in that column to nulls instead of throwing an error that there is a type mismatch.

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Andrew Or

Reporter:: Amogh Param

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Dec/16 20:50

Updated:: 07/Dec/16 08:26

Resolved:: 05/Dec/16 19:30

Agile

View on Board

Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment