Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: 2.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      We already have a pyspark.sql.functions.approx_count_distinct() that can be applied to grouped data, but it seems odd that you can't just get regular approximate count for grouped data.

      I imagine the API would mirror that for RDD.countApprox(), but I'm not sure:

      (df
          .groupBy('col1')
          .countApprox(timeout=300, confidence=0.95)
          .show())
      

      Or, if we want to mirror the approx_count_distinct() function, we can do that too. I'd want to understand why that function doesn't take a timeout or confidence parameter, though. Also, what does rsd mean? It's not documented.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              nchammas Nicholas Chammas
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: