Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18853

Project (UnaryNode) is way too aggressive in estimating statistics

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.3, 2.1.0
    • SQL
    • None

    Description

      We currently define statistics in UnaryNode:

        override def statistics: Statistics = {
          // There should be some overhead in Row object, the size should not be zero when there is
          // no columns, this help to prevent divide-by-zero error.
          val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
          val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
          // Assume there will be the same number of rows as child has.
          var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / childRowSize
          if (sizeInBytes == 0) {
            // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
            // (product of children).
            sizeInBytes = 1
          }
      
          child.statistics.copy(sizeInBytes = sizeInBytes)
        }
      

      This has a few issues:

      1. This can aggressively underestimate the size for Project. We assume each array/map has 100 elements, which is an overestimate. If the user projects a single field out of a deeply nested field, this would lead to huge underestimation. A safer sane default is probably 1.

      2. It is not a property of UnaryNode to propagate statistics this way. It should be a property of Project.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rxin Reynold Xin Assign to me
            rxin Reynold Xin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment