Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18709

Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6.2, 1.6.3
    • 2.0.0
    • SQL
    • Important

    Description

      When converting an RDD with a `float` type field to a spark dataframe with an `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently convert the field values to nulls instead of throwing an error like `LongType can not accept object ___ in type <type 'float'>`. However, this seems to be fixed in Spark 2.0.2.

      The following example should make the problem clear:

      from pyspark.sql.types import StructField, StructType, LongType, DoubleType
      
      schema = StructType([
              StructField("0", LongType(), True),
              StructField("1", DoubleType(), True),
          ])
      
      data = [[1.0, 1.0], [nan, 2.0]]
      spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
      spark_df.show()
      

      Instead of throwing an error like:

      LongType can not accept object 1.0 in type <type 'float'>
      

      Spark converts all the values in the first column to nulls

      Running `spark_df.show()` gives:

      +----+---+
      |   0|  1|
      +----+---+
      |null|1.0|
      |null|1.0|
      +----+---+
      

      For the purposes of my computation, I'm doing a `mapPartitions` on a spark data frame, and for each partition, converting it into a pandas data frame, doing a few computations on this pandas dataframe and the return value will be a list of lists, which is converted to an RDD while being returned from 'mapPartitions' (for all partitions). This RDD is then converted into a spark dataframe similar to the example above, using `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should be converted to a `LongType` in the spark data frame, but since it has missing values, it is a `float` type. When spark tries to create the data frame, it converts all the values in that column to nulls instead of throwing an error that there is a type mismatch.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            andrewor14 Andrew Or
            amogh.91 Amogh Param
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment