Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13846

VectorIndexer output on unknown feature should be more descriptive

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.6.1
    • Fix Version/s: 2.3.0
    • Component/s: ML
    • Labels:
      None

      Description

      I got the exception and looks like it's related to unknown categorical variable value passed to indexing.

      java.util.NoSuchElementException: key not found: 20.0
      at scala.collection.MapLike$class.default(MapLike.scala:228)
      at scala.collection.AbstractMap.default(Map.scala:58)
      at scala.collection.MapLike$class.apply(MapLike.scala:141)
      at scala.collection.AbstractMap.apply(Map.scala:58)
      at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:316)
      at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:315)
      at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
      at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
      at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:315)
      at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
      at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
      at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown Source)
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
      at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
      at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
      at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      at org.apache.spark.scheduler.Task.run(Task.scala:89)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      VectorIndexer created like
      val featureIndexer = new VectorIndexer()
      .setInputCol(DataFrameColumns.FEATURES)
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5)
      .fit(trainingDF)

      Output should be not just default java.util.NoSuchElementException, but something specific like UnknownCategoricalValue with information, that could help to find the source element of vector (element index in vector maybe).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                WeichenXu123 Weichen Xu
                Reporter:
                spikhalskiy Dmitry Spikhalskiy
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: