Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17963

Add examples (extend) in each function and improve documentation

    XMLWordPrintableJSON

Details

    • Documentation
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0
    • SQL
    • None

    Description

      Currently, it seems function documentation is inconsistent and does not have examples (extend much.

      For example, some functions have a bad indentation as below:

      spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
      Function: approx_count_distinct
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
      Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
          approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
            with relativeSD, the maximum estimation error allowed.
      
      Extended Usage:
      No example for approx_count_distinct.
      
      spark-sql> DESCRIBE FUNCTION EXTENDED count;
      Function: count
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
      Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values.
          count(expr) - Returns the number of rows for which the supplied expression is non-NULL.
          count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.
      Extended Usage:
      No example for count.
      

      whereas some do have a pretty one

      spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
      Function: percentile_approx
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
      Usage:
            percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
            column `col` at the given percentage. The value of percentage must be between 0.0
            and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
            controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
            better accuracy, `1.0/accuracy` is the relative error of the approximation.
      
            percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
            percentile array of column `col` at the given percentage array. Each value of the
            percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
             a positive integer literal which controls approximation accuracy at the cost of memory.
             Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
             the approximation.
      
      Extended Usage:
      No example for percentile_approx.
      

      Also, there are several inconsistent indentation, for example, FUNC(a,b) and FUNC(a, b) (note the indentation between arguments.

      It'd be nicer if most of them have a good example with possible argument types.

      Suggested format is as below for multiple line usage:

      spark-sql> DESCRIBE FUNCTION EXTENDED rand;
      Function: rand
      Class: org.apache.spark.sql.catalyst.expressions.Rand
      Usage:
            rand() - Returns a random column with i.i.d. uniformly distributed values in [0, 1].
              seed is given randomly.
      
            rand(seed) - Returns a random column with i.i.d. uniformly distributed values in [0, 1].
              seed should be an integer/long/NULL literal.
      
      Extended Usage:
      > SELECT rand();
       0.9629742951434543
      > SELECT rand(0);
       0.8446490682263027
      > SELECT rand(NULL);
       0.8446490682263027
      

      For single line usage:

      spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
      Function: date_add
      Class: org.apache.spark.sql.catalyst.expressions.DateAdd
      Usage: date_add(start_date, num_days) - Returns the date that is num_days after start_date.
      Extended Usage:
      > SELECT date_add('2016-07-30', 1);
       '2016-07-31'
      

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: