[SPARK-18940] Percentile and approximate percentile support for frequency distribution table - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.0.2
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

I have a frequency distribution table with following entries

Age,    No of person 
21, 10
22, 15
23, 18 
..
..
30, 14

Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
It would be very difficult and complex to find the percentile.
Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration
Current Percentile definition

percentile(col, array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, percentageExpression, 0, 0)
  }
}

Proposed changes

percentile(col, [frequency], array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  frequency : Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, Literal(1L), percentageExpression, 0, 0)
  }
  def this(child: Expression, frequency : Expression, percentageExpression: Expression) = {
    this(child, frequency, percentageExpression, 0, 0)
  }
}

Although this definition will differ from hive implementation, it will be useful functionality to many spark user.
Moreover the changes are local to only Percentile and ApproxPercentile implementation

Attachments

Sub-Tasks

Create Sub-Task

There are no Sub-Tasks for this issue.

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: gagan taneja

Shepherd:: Herman van Hövell

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Dec/16 08:11

Updated:: 21/May/19 04:17

Resolved:: 21/May/19 04:17

Agile

View on Board

Percentile and approximate percentile support for frequency distribution table

Details

Description

Attachments

Attachments

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment