[SPARK-18853] Project (UnaryNode) is way too aggressive in estimating statistics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.3, 2.1.0
Component/s: SQL
Labels:
None

Target Version/s:

2.0.3, 2.1.1, 2.2.0

Description

We currently define statistics in UnaryNode:

  override def statistics: Statistics = {
    // There should be some overhead in Row object, the size should not be zero when there is
    // no columns, this help to prevent divide-by-zero error.
    val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
    val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
    // Assume there will be the same number of rows as child has.
    var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / childRowSize
    if (sizeInBytes == 0) {
      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
      // (product of children).
      sizeInBytes = 1
    }

    child.statistics.copy(sizeInBytes = sizeInBytes)
  }

This has a few issues:

1. This can aggressively underestimate the size for Project. We assume each array/map has 100 elements, which is an overestimate. If the user projects a single field out of a deeply nested field, this would lead to huge underestimation. A safer sane default is probably 1.

2. It is not a property of UnaryNode to propagate statistics this way. It should be a property of Project.

Attachments

Issue Links

relates to

SPARK-18676 Spark 2.x query plan data size estimation can crash join queries versus 1.x

Resolved

links to

[Github] Pull Request #16274 (rxin)

Activity

People

Assignee:: Reynold Xin

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Dec/16 00:27

Updated:: 14/Dec/16 20:33

Resolved:: 14/Dec/16 20:24