Uploaded image for project: 'Apache AsterixDB'
  1. Apache AsterixDB
  2. ASTERIXDB-3470

CBO estimates incorrect join cardinality with IMDb datasets

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • COMP - Compiler

    Description

      The join cardinalities with IMDb datasets are much overestimated and the inaccuracy increases with the increase in sample sizes.

      IMDb dataset: http://homepages.cwi.nl/~boncz/job/imdb.tgz  (Please change the corresponding CSV file name to "keywords.csv")

      In the attachment, two 10-datasets join queries and the relevant DDL statements are given. Both join queries are with 10 IMDb datasets and different selectivity predicates.

      1. The actual cardinality of the first join query is 1298, where the estimated ones are:
        • with "low" sample: 7.489 x 1010
        • with "medium" sample: 5.619 x 1012
        • with "high" sample: 6.022 x 1012
      2. The actual cardinality of the second one is 1062, whereas the estimated ones are:
        • with "low" sample: 4.33 x 108
        • with "medium" sample: 1.479 x 1011
        • with "high" sample: 1.93 x 1011

      Attachments

        1. schema.txt
          2 kB
          Mehnaz Tabassum Mahin
        2. load_and_analyze.txt
          2 kB
          Mehnaz Tabassum Mahin
        3. join_queries.txt
          2 kB
          Mehnaz Tabassum Mahin

        Activity

          People

            murali4104 murali krishna
            Mehnaz Mehnaz Tabassum Mahin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: