Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.1, 1.4.0
    • Component/s: SQL
    • Labels:
    • Target Version/s:
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      DataFrame.describe should return a DataFrame with summary statistics.

      def describe(cols: String*): DataFrame
      

      If cols is empty, then run describe on all numeric columns.

      The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and n + 1 columns. The 1st column is the name of the aggregate function, and the next n columns are the numeric columns of interest in the input DataFrame.

      Similar to Pandas (but removing percentile since accurate percentiles are too expensive to compute for Big Data)

      In [19]: df.describe()
      Out[19]: 
                    A         B         C         D
      count  6.000000  6.000000  6.000000  6.000000
      mean   0.073711 -0.431125 -0.687758 -0.233103
      std    0.843157  0.922818  0.779887  0.973118
      min   -0.861849 -2.104569 -1.509059 -1.135632
      max    1.212112  0.567020  0.276232  1.071804
      

        Attachments

          Activity

            People

            • Assignee:
              azagrebin Andrey Zagrebin
              Reporter:
              rxin Reynold Xin
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: