[SPARK-27612] Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: PySpark, SQL
Labels:
- correctness

Target Version/s:

3.0.0

Description

This seems to only affect Python 3.

When creating a DataFrame with type ArrayType(IntegerType(), True) there ends up being rows that are filled with None.

In [1]: from pyspark.sql.types import ArrayType, IntegerType                                                                 

In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), True))                                     

In [3]: df.distinct().collect()                                                                                              
Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]

From this example, it is consistently at elements 97, 98:

In [5]: df.collect()[-5:]                                                                                                    
Out[5]: 
[Row(value=[1, 2, 3, 4]),
 Row(value=[1, 2, 3, 4]),
 Row(value=[None, None, None, None]),
 Row(value=[None, None, None, None]),
 Row(value=[1, 2, 3, 4])]

This also happens with a type of ArrayType(ArrayType(IntegerType(), True))

Attachments

Issue Links

is related to

SPARK-27629 Prevent Unpickler from intervening each unpickling

Resolved

relates to

SPARK-18161 Default PickleSerializer pickle protocol doesn't handle > 4GB objects

Resolved

links to

GitHub Pull Request #24519

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Apr/19 23:26

Updated:: 12/Dec/22 18:10

Resolved:: 03/May/19 05:40