Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11569

StringIndexer transform fails when column contains nulls

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.5.0, 1.6.0
    • 2.2.0
    • ML, PySpark
    • None

    Description

      Transforming column containing null values using StringIndexer results in java.lang.NullPointerException

      from pyspark.ml.feature import StringIndexer
      
      df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
      df.printSchema()
      ## root
      ##  |-- k: string (nullable = true)
      ##  |-- v: long (nullable = true)
      
      indexer = StringIndexer(inputCol="k", outputCol="kIdx")
      
      indexer.fit(df).transform(df)
      ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
      ## : java.lang.NullPointerException
      

      Problem disappears when we drop

      df1 = df.na.drop()
      indexer.fit(df1).transform(df1)
      

      or replace nulls

      from pyspark.sql.functions import col, when
      
      k = col("k")
      df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
      indexer.fit(df2).transform(df2)
      

      and cannot be reproduced using Scala API

      import org.apache.spark.ml.feature.StringIndexer
      
      val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
      df.printSchema
      // root
      //  |-- k: string (nullable = true)
      //  |-- v: integer (nullable = false)
      
      val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
      
      indexer.fit(df).transform(df).count
      // 2
      

      Attachments

        Issue Links

          Activity

            People

              crackcell Menglong TAN
              zero323 Maciej Szymkiewicz
              Joseph K. Bradley Joseph K. Bradley
              Votes:
              3 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: