Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11569

StringIndexer transform fails when column contains nulls

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0, 1.5.0, 1.6.0
    • Fix Version/s: 2.2.0
    • Component/s: ML, PySpark
    • Labels:
      None

      Description

      Transforming column containing null values using StringIndexer results in java.lang.NullPointerException

      from pyspark.ml.feature import StringIndexer
      
      df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
      df.printSchema()
      ## root
      ##  |-- k: string (nullable = true)
      ##  |-- v: long (nullable = true)
      
      indexer = StringIndexer(inputCol="k", outputCol="kIdx")
      
      indexer.fit(df).transform(df)
      ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
      ## : java.lang.NullPointerException
      

      Problem disappears when we drop

      df1 = df.na.drop()
      indexer.fit(df1).transform(df1)
      

      or replace nulls

      from pyspark.sql.functions import col, when
      
      k = col("k")
      df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
      indexer.fit(df2).transform(df2)
      

      and cannot be reproduced using Scala API

      import org.apache.spark.ml.feature.StringIndexer
      
      val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
      df.printSchema
      // root
      //  |-- k: string (nullable = true)
      //  |-- v: integer (nullable = false)
      
      val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
      
      indexer.fit(df).transform(df).count
      // 2
      

        Issue Links

          Activity

          Hide
          jliwork Jia Li added a comment -

          I'm working on a PR to fix this.

          Show
          jliwork Jia Li added a comment - I'm working on a PR to fix this.
          Hide
          zero323 Maciej Szymkiewicz added a comment -

          It looks this problem affects Scala after all:

          
          val df = sqlContext.createDataFrame(
            Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0), 
                (null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1), 
                ("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
          ).toDF("x0","x1","x2","x3")
          val indexer = new StringIndexer().setInputCol("x0").setOutputCol("x0idx")
          
          indexer.fit(df).transform(df).show
          // java.lang.NullPointerException
          //	at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:208)
          //	at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196)
          //	at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196)
          //	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
          

          Source: http://stackoverflow.com/q/33574807/1560062

          Show
          zero323 Maciej Szymkiewicz added a comment - It looks this problem affects Scala after all: val df = sqlContext.createDataFrame( Seq(( "asd2s" , "1e1e" ,1.1,0), ( "asd2s" , "1e1e" ,0.1,0), ( null , "1e3e" ,1.2,0), ( "bd34t" , "1e1e" ,5.1,1), ( "asd2s" , "1e3e" ,0.2,0), ( "bd34t" , "1e2e" ,4.3,1)) ).toDF( "x0" , "x1" , "x2" , "x3" ) val indexer = new StringIndexer().setInputCol( "x0" ).setOutputCol( "x0idx" ) indexer.fit(df).transform(df).show // java.lang.NullPointerException // at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:208) // at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196) // at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:196) // at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) Source: http://stackoverflow.com/q/33574807/1560062
          Hide
          jliwork Jia Li added a comment -

          Hi Joseph K. Bradley Holden Karau,

          I'd like to hear your opinion on the expected behavior for this test case. I can think of these possibilities:

          1) the tuple with null gets the last index as shown below

          ------------

          x0 x1 x2 x3 x0idx

          ------------

          asd2s 1e1e 1.1 0 0.0
          asd2s 1e1e 0.1 0 0.0
          null 1e3e 1.2 0 2.0
          bd34t 1e1e 5.1 1 1.0
          asd2s 1e3e 0.2 0 0.0
          bd34t 1e2e 4.3 1 1.0

          ------------

          2) the tuple with null gets index 0 before everything else
          3) eliminate the tuple from the result

          Which one do you prefer?

          Thanks,

          Show
          jliwork Jia Li added a comment - Hi Joseph K. Bradley Holden Karau , I'd like to hear your opinion on the expected behavior for this test case. I can think of these possibilities: 1) the tuple with null gets the last index as shown below ---- -- - - ---- x0 x1 x2 x3 x0idx ---- -- - - ---- asd2s 1e1e 1.1 0 0.0 asd2s 1e1e 0.1 0 0.0 null 1e3e 1.2 0 2.0 bd34t 1e1e 5.1 1 1.0 asd2s 1e3e 0.2 0 0.0 bd34t 1e2e 4.3 1 1.0 ---- -- - - ---- 2) the tuple with null gets index 0 before everything else 3) eliminate the tuple from the result Which one do you prefer? Thanks,
          Hide
          apachespark Apache Spark added a comment -

          User 'jliwork' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9709

          Show
          apachespark Apache Spark added a comment - User 'jliwork' has created a pull request for this issue: https://github.com/apache/spark/pull/9709
          Hide
          josephkb Joseph K. Bradley added a comment -

          To choose the right API, my first comments are:

          • What do other libraries do when given null/bad values? (scikit-learn and R are the ones I tend to look at.)
          • I'd prefer to make the behavior adjustable using an option with a default. The default I'd vote for is throwing a nice error upon seeing null, though I could be convinced to go for another.
          • When we do index null, we should ideally maintain current indexing behavior, so it may make the most sense to put null at the end.
          Show
          josephkb Joseph K. Bradley added a comment - To choose the right API, my first comments are: What do other libraries do when given null/bad values? (scikit-learn and R are the ones I tend to look at.) I'd prefer to make the behavior adjustable using an option with a default. The default I'd vote for is throwing a nice error upon seeing null, though I could be convinced to go for another. When we do index null, we should ideally maintain current indexing behavior, so it may make the most sense to put null at the end.
          Hide
          apachespark Apache Spark added a comment -

          User 'jliwork' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9920

          Show
          apachespark Apache Spark added a comment - User 'jliwork' has created a pull request for this issue: https://github.com/apache/spark/pull/9920
          Hide
          timhunter Timothy Hunter added a comment -

          Also, I suggest to look at Pandas' indexers, which have the same issue to deal with.

          Show
          timhunter Timothy Hunter added a comment - Also, I suggest to look at Pandas' indexers, which have the same issue to deal with.
          Hide
          barrybecker4 Barry Becker added a comment -

          Null should somehow be treated as separate from the other known values.
          If the index cannot be maintained as null, then my second choice would be for it to be some sort of special value like -1.
          ML algorithms that operate on these index values should be able to differentiate null values from known values.

          Show
          barrybecker4 Barry Becker added a comment - Null should somehow be treated as separate from the other known values. If the index cannot be maintained as null, then my second choice would be for it to be some sort of special value like -1. ML algorithms that operate on these index values should be able to differentiate null values from known values.
          Hide
          imatiach Ilya Matiach added a comment -

          @jliwork @srowen are you currently working on this in-progress JIRA 11569? If not, I would be interested in continuing the initial pull request that was closed. Please let me know, thank you!

          Show
          imatiach Ilya Matiach added a comment - @jliwork @srowen are you currently working on this in-progress JIRA 11569? If not, I would be interested in continuing the initial pull request that was closed. Please let me know, thank you!
          Hide
          josephkb Joseph K. Bradley added a comment - - edited

          Hi all, I'm sorry for not following up on this, but I would like us to do this at some point. However, I will insist that we do some research before adding an API based on just a few users' requirements. Have you looked at other libraries?

          • scikit-learn -> LabelIndexer does not seem to handle null values
          • various R libraries
          • other more specialized but popular ML libraries
          Show
          josephkb Joseph K. Bradley added a comment - - edited Hi all, I'm sorry for not following up on this, but I would like us to do this at some point. However, I will insist that we do some research before adding an API based on just a few users' requirements. Have you looked at other libraries? scikit-learn -> LabelIndexer does not seem to handle null values various R libraries other more specialized but popular ML libraries
          Hide
          apachespark Apache Spark added a comment -

          User 'crackcell' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17233

          Show
          apachespark Apache Spark added a comment - User 'crackcell' has created a pull request for this issue: https://github.com/apache/spark/pull/17233
          Hide
          josephkb Joseph K. Bradley added a comment -

          Issue resolved by pull request 17233
          https://github.com/apache/spark/pull/17233

          Show
          josephkb Joseph K. Bradley added a comment - Issue resolved by pull request 17233 https://github.com/apache/spark/pull/17233
          Hide
          josephkb Joseph K. Bradley added a comment -

          Linking SPARK-19852, which can update the Python API.

          Show
          josephkb Joseph K. Bradley added a comment - Linking SPARK-19852 , which can update the Python API.

            People

            • Assignee:
              crackcell Menglong TAN
              Reporter:
              zero323 Maciej Szymkiewicz
              Shepherd:
              Joseph K. Bradley
            • Votes:
              3 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development