Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34165

Add countDistinct option to Dataset#summary

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • SQL
    • None

    Description

      The Dataset#summary function supports options like count, mean, min, and max.  It's a great little function for lightweight exploratory data analysis.

      A count distinct of each column is a common exploratory data analysis workflow.  This should be easy to add (piggybacking off the existing countDistinct code), entirely backwards compatible, and will help a lot of users.

      Attachments

        Activity

          People

            mrpowers Matthew Powers
            mrpowers Matthew Powers
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: