Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27612

Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: PySpark, SQL
    • Labels:

      Description

      This seems to only affect Python 3.

      When creating a DataFrame with type ArrayType(IntegerType(), True) there ends up being rows that are filled with None.

       

      In [1]: from pyspark.sql.types import ArrayType, IntegerType                                                                 
      
      In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), True))                                     
      
      In [3]: df.distinct().collect()                                                                                              
      Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
      

       

      From this example, it is consistently at elements 97, 98:

      In [5]: df.collect()[-5:]                                                                                                    
      Out[5]: 
      [Row(value=[1, 2, 3, 4]),
       Row(value=[1, 2, 3, 4]),
       Row(value=[None, None, None, None]),
       Row(value=[None, None, None, None]),
       Row(value=[1, 2, 3, 4])]
      

      This also happens with a type of ArrayType(ArrayType(IntegerType(), True))

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hyukjin.kwon Hyukjin Kwon
                Reporter:
                bryanc Bryan Cutler
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: