Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11569

StringIndexer transform fails when column contains nulls

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0, 1.5.0, 1.6.0
    • Fix Version/s: 2.2.0
    • Component/s: ML, PySpark
    • Labels:
      None

      Description

      Transforming column containing null values using StringIndexer results in java.lang.NullPointerException

      from pyspark.ml.feature import StringIndexer
      
      df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
      df.printSchema()
      ## root
      ##  |-- k: string (nullable = true)
      ##  |-- v: long (nullable = true)
      
      indexer = StringIndexer(inputCol="k", outputCol="kIdx")
      
      indexer.fit(df).transform(df)
      ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
      ## : java.lang.NullPointerException
      

      Problem disappears when we drop

      df1 = df.na.drop()
      indexer.fit(df1).transform(df1)
      

      or replace nulls

      from pyspark.sql.functions import col, when
      
      k = col("k")
      df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
      indexer.fit(df2).transform(df2)
      

      and cannot be reproduced using Scala API

      import org.apache.spark.ml.feature.StringIndexer
      
      val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
      df.printSchema
      // root
      //  |-- k: string (nullable = true)
      //  |-- v: integer (nullable = false)
      
      val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
      
      indexer.fit(df).transform(df).count
      // 2
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                crackcell Menglong TAN
                Reporter:
                zero323 Maciej Szymkiewicz
                Shepherd:
                Joseph K. Bradley
              • Votes:
                3 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: