Description
We currently define statistics in UnaryNode:
override def statistics: Statistics = { // There should be some overhead in Row object, the size should not be zero when there is // no columns, this help to prevent divide-by-zero error. val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8 val outputRowSize = output.map(_.dataType.defaultSize).sum + 8 // Assume there will be the same number of rows as child has. var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / childRowSize if (sizeInBytes == 0) { // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero // (product of children). sizeInBytes = 1 } child.statistics.copy(sizeInBytes = sizeInBytes) }
This has a few issues:
1. This can aggressively underestimate the size for Project. We assume each array/map has 100 elements, which is an overestimate. If the user projects a single field out of a deeply nested field, this would lead to huge underestimation. A safer sane default is probably 1.
2. It is not a property of UnaryNode to propagate statistics this way. It should be a property of Project.
Attachments
Issue Links
- relates to
-
SPARK-18676 Spark 2.x query plan data size estimation can crash join queries versus 1.x
- Resolved
- links to