Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16542

bugs about types that result an array of null when creating dataframe using python

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3.0
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      This is a bugs about types that result an array of null when creating DataFrame using python.

      Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows.

      A simple code to reproduce this is:

      from pyspark import SparkContext
      from pyspark.sql import SQLContext,Row,DataFrame
      from array import array
      
      sc = SparkContext()
      sqlContext = SQLContext(sc)
      
      row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
      rows = sc.parallelize([ row1 ])
      df = sqlContext.createDataFrame(rows)
      df.show()
      

      which have output

      +---------------+------------------+
      |    doublearray|        floatarray|
      +---------------+------------------+
      |[1.0, 2.0, 3.0]|[null, null, null]|
      +---------------+------------------+
      

        Attachments

          Activity

            People

            • Assignee:
              zasdfgbnm Xiang Gao
              Reporter:
              zasdfgbnm Xiang Gao
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: