[SPARK-6117] describe function for summary statistics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3.1, 1.4.0
Component/s: SQL
Labels:
- starter

Target Version/s:

1.4.0
Sprint:
Spark 1.5 doc/QA sprint

Description

DataFrame.describe should return a DataFrame with summary statistics.

def describe(cols: String*): DataFrame

If cols is empty, then run describe on all numeric columns.

The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and n + 1 columns. The 1st column is the name of the aggregate function, and the next n columns are the numeric columns of interest in the input DataFrame.

Similar to Pandas (but removing percentile since accurate percentiles are too expensive to compute for Big Data)

In [19]: df.describe()
Out[19]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
max    1.212112  0.567020  0.276232  1.071804

Attachments

Issue Links

links to

[Github] Pull Request #5073 (azagrebin)

[Github] Pull Request #5201 (rxin)

Activity

People

Assignee:: Andrey Zagrebin

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 02/Mar/15 21:22

Updated:: 24/Apr/15 00:41

Resolved:: 26/Mar/15 19:26