[SPARK-16542] bugs about types that result an array of null when creating dataframe using python - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: PySpark, SQL
Labels:
None

Description

This is a bugs about types that result an array of null when creating DataFrame using python.

Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows.

A simple code to reproduce this is:

from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array

sc = SparkContext()
sqlContext = SQLContext(sc)

row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()

which have output

+---------------+------------------+
|    doublearray|        floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+

Attachments

Issue Links

links to

[Github] Pull Request #14198 (zasdfgbnm)

[Github] Pull Request #18444 (zasdfgbnm)

Activity

People

Assignee:: Xiang Gao

Reporter:: Xiang Gao

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Jul/16 09:00

Updated:: 20/Jul/17 03:46

Resolved:: 20/Jul/17 03:46