Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
The join cardinalities with IMDb datasets are much overestimated and the inaccuracy increases with the increase in sample sizes.
IMDb dataset: http://homepages.cwi.nl/~boncz/job/imdb.tgz (Please change the corresponding CSV file name to "keywords.csv")
In the attachment, two 10-datasets join queries and the relevant DDL statements are given. Both join queries are with 10 IMDb datasets and different selectivity predicates.
- The actual cardinality of the first join query is 1298, where the estimated ones are:
- with "low" sample: 7.489 x 1010
- with "medium" sample: 5.619 x 1012
- with "high" sample: 6.022 x 1012
- The actual cardinality of the second one is 1062, whereas the estimated ones are:
- with "low" sample: 4.33 x 108
- with "medium" sample: 1.479 x 1011
- with "high" sample: 1.93 x 1011