Description
The documentation says:
null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.
However, this small pyspark example
sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 0.001)
throws
Py4JJavaError: An error occurred while calling o3493.approxQuantile. : java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) at scala.collection.TraversableLike$class.last(TraversableLike.scala:431) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
Attachments
Issue Links
- is duplicated by
-
SPARK-19573 Make NaN/null handling consistent in approxQuantile
- Resolved