Details
Description
Transforming column containing null values using StringIndexer results in java.lang.NullPointerException
from pyspark.ml.feature import StringIndexer df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v")) df.printSchema() ## root ## |-- k: string (nullable = true) ## |-- v: long (nullable = true) indexer = StringIndexer(inputCol="k", outputCol="kIdx") indexer.fit(df).transform(df) ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: py4j.protocol.Py4JJavaError: An error occurred while calling o75.json. ## : java.lang.NullPointerException
Problem disappears when we drop
df1 = df.na.drop() indexer.fit(df1).transform(df1)
or replace nulls
from pyspark.sql.functions import col, when k = col("k") df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k)) indexer.fit(df2).transform(df2)
and cannot be reproduced using Scala API
import org.apache.spark.ml.feature.StringIndexer val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v") df.printSchema // root // |-- k: string (nullable = true) // |-- v: integer (nullable = false) val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx") indexer.fit(df).transform(df).count // 2
Attachments
Issue Links
- is duplicated by
-
SPARK-12779 StringIndexer should handle null
- Closed
- is related to
-
SPARK-19852 StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
- Resolved
- relates to
-
SPARK-17498 StringIndexer.setHandleInvalid should have another option 'new'
- Resolved
- links to
(1 links to)