Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.3.1, 1.4.0
    • SQL
    • Spark 1.5 doc/QA sprint

    Description

      DataFrame.describe should return a DataFrame with summary statistics.

      def describe(cols: String*): DataFrame
      

      If cols is empty, then run describe on all numeric columns.

      The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and n + 1 columns. The 1st column is the name of the aggregate function, and the next n columns are the numeric columns of interest in the input DataFrame.

      Similar to Pandas (but removing percentile since accurate percentiles are too expensive to compute for Big Data)

      In [19]: df.describe()
      Out[19]: 
                    A         B         C         D
      count  6.000000  6.000000  6.000000  6.000000
      mean   0.073711 -0.431125 -0.687758 -0.233103
      std    0.843157  0.922818  0.779887  0.973118
      min   -0.861849 -2.104569 -1.509059 -1.135632
      max    1.212112  0.567020  0.276232  1.071804
      

      Attachments

        Activity

          People

            azagrebin Andrey Zagrebin
            rxin Reynold Xin
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: