Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18940

Percentile and approximate percentile support for frequency distribution table

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.2
    • None
    • SQL

    Description

      I have a frequency distribution table with following entries

      Age,    No of person 
      21, 10
      22, 15
      23, 18 
      ..
      ..
      30, 14
      

      Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
      It would be very difficult and complex to find the percentile.
      Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration
      Current Percentile definition

      percentile(col, array(percentage1 [, percentage2]...))
      case class Percentile(
        child: Expression,
        percentageExpression: Expression,
        mutableAggBufferOffset: Int = 0,
        inputAggBufferOffset: Int = 0) {
         def this(child: Expression, percentageExpression: Expression) = {
          this(child, percentageExpression, 0, 0)
        }
      }
      

      Proposed changes

      percentile(col, [frequency], array(percentage1 [, percentage2]...))
      case class Percentile(
        child: Expression,
        frequency : Expression,
        percentageExpression: Expression,
        mutableAggBufferOffset: Int = 0,
        inputAggBufferOffset: Int = 0) {
         def this(child: Expression, percentageExpression: Expression) = {
          this(child, Literal(1L), percentageExpression, 0, 0)
        }
        def this(child: Expression, frequency : Expression, percentageExpression: Expression) = {
          this(child, frequency, percentageExpression, 0, 0)
        }
      }
      

      Although this definition will differ from hive implementation, it will be useful functionality to many spark user.
      Moreover the changes are local to only Percentile and ApproxPercentile implementation

      Attachments

        There are no Sub-Tasks for this issue.

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tanejagagan gagan taneja
            Herman van Hövell Herman van Hövell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment