Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27612

Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • PySpark, SQL

    Description

      This seems to only affect Python 3.

      When creating a DataFrame with type ArrayType(IntegerType(), True) there ends up being rows that are filled with None.

       

      In [1]: from pyspark.sql.types import ArrayType, IntegerType                                                                 
      
      In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), True))                                     
      
      In [3]: df.distinct().collect()                                                                                              
      Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
      

       

      From this example, it is consistently at elements 97, 98:

      In [5]: df.collect()[-5:]                                                                                                    
      Out[5]: 
      [Row(value=[1, 2, 3, 4]),
       Row(value=[1, 2, 3, 4]),
       Row(value=[None, None, None, None]),
       Row(value=[None, None, None, None]),
       Row(value=[1, 2, 3, 4])]
      

      This also happens with a type of ArrayType(ArrayType(IntegerType(), True))

      Attachments

        Issue Links

          Activity

            People

              hyukjin.kwon Hyukjin Kwon
              bryanc Bryan Cutler
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: