Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31735

Include all columns in the summary report

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.5
    • Fix Version/s: None
    • Component/s: Spark Core, SQL
    • Labels:
      None

      Description

      Dates and other columns are excluded:

       

      from datetime import datetime, timedelta, timezone
      from pyspark.sql import types as T
      from pyspark.sql import Row
      from pyspark.sql import functions as FSTART = datetime(2014, 1, 1, tzinfo=timezone.utc)n_days = 22date_range = [Row(date=(START + timedelta(days=n))) for n in range(0, n_days)]schema = T.StructType([T.StructField(name="date", dataType=T.DateType(), nullable=False)])
      rdd = spark.sparkContext.parallelize(date_range)df = spark.createDataFrame(data=rdd, schema=schema)
      df.agg(F.max("date")).show()df.summary().show()
      -------
      |summary|
      -------
      | count |
      | mean  |
      | stddev|
      | min   |
      | 25%   |
      | 50%   |
      | 75%   |
      | max   |
      -------

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              fokko Fokko Driesprong
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: