Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
3.0.0
Description
This seems to only affect Python 3.
When creating a DataFrame with type ArrayType(IntegerType(), True) there ends up being rows that are filled with None.
In [1]: from pyspark.sql.types import ArrayType, IntegerType
In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), True))
In [3]: df.distinct().collect()
Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
From this example, it is consistently at elements 97, 98:
In [5]: df.collect()[-5:] Out[5]: [Row(value=[1, 2, 3, 4]), Row(value=[1, 2, 3, 4]), Row(value=[None, None, None, None]), Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
This also happens with a type of ArrayType(ArrayType(IntegerType(), True))
Attachments
Issue Links
- is related to
-
SPARK-27629 Prevent Unpickler from intervening each unpickling
- Resolved
- relates to
-
SPARK-18161 Default PickleSerializer pickle protocol doesn't handle > 4GB objects
- Resolved
- links to