There seems to be a regression in Spark 3.0.0, with regard to how NaN values are normalized/handled in COUNT(DISTINCT ...). Here is an illustration:
Here are the results under Spark 3.0.0:
Note that the count against mithunr is 3, accounting for each distinct value for NaN.
The right results are returned when another aggregation is added to the GBY:
Also, note that Spark 2.4.6 normalizes the DISTINCT expression correctly: