We already have a pyspark.sql.functions.approx_count_distinct() that can be applied to grouped data, but it seems odd that you can't just get regular approximate count for grouped data.
I imagine the API would mirror that for RDD.countApprox(), but I'm not sure:
Or, if we want to mirror the approx_count_distinct() function, we can do that too. I'd want to understand why that function doesn't take a timeout or confidence parameter, though. Also, what does rsd mean? It's not documented.