Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16542

bugs about types that result an array of null when creating dataframe using python

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • PySpark, SQL
    • None

    Description

      This is a bugs about types that result an array of null when creating DataFrame using python.

      Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows.

      A simple code to reproduce this is:

      from pyspark import SparkContext
      from pyspark.sql import SQLContext,Row,DataFrame
      from array import array
      
      sc = SparkContext()
      sqlContext = SQLContext(sc)
      
      row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
      rows = sc.parallelize([ row1 ])
      df = sqlContext.createDataFrame(rows)
      df.show()
      

      which have output

      +---------------+------------------+
      |    doublearray|        floatarray|
      +---------------+------------------+
      |[1.0, 2.0, 3.0]|[null, null, null]|
      +---------------+------------------+
      

      Attachments

        Activity

          People

            zasdfgbnm Xiang Gao
            zasdfgbnm Xiang Gao
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: