Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16355

Incorrect Statistics when Queries Containing LIMIT/TABLESAMPLE 0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • None
    • SQL
    • None

    Description

      When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example,

            Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v")
              .createOrReplaceTempView("test")
            val df1 = spark.table("test")
            val df2 = spark.table("test").limit(0)
            val df = df1.join(df2, Seq("k"), "left")
      

      The statistics of both df and df2 are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children).

      Attachments

        Activity

          People

            smilegator Xiao Li
            smilegator Xiao Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: