Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31735

Include all columns in the summary report

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.5
    • None
    • Spark Core, SQL

    Description

      Dates and other columns are excluded:

       

      from datetime import datetime, timedelta, timezone
      from pyspark.sql import types as T
      from pyspark.sql import Row
      from pyspark.sql import functions as FSTART = datetime(2014, 1, 1, tzinfo=timezone.utc)n_days = 22date_range = [Row(date=(START + timedelta(days=n))) for n in range(0, n_days)]schema = T.StructType([T.StructField(name="date", dataType=T.DateType(), nullable=False)])
      rdd = spark.sparkContext.parallelize(date_range)df = spark.createDataFrame(data=rdd, schema=schema)
      df.agg(F.max("date")).show()df.summary().show()
      -------
      |summary|
      -------
      | count |
      | mean  |
      | stddev|
      | min   |
      | 25%   |
      | 50%   |
      | 75%   |
      | max   |
      -------

      Attachments

        Activity

          People

            Unassigned Unassigned
            fokko Fokko Driesprong
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: