[ASTERIXDB-3470] CBO estimates incorrect join cardinality with IMDb datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: COMP - Compiler
Labels:
- triaged

Description

The join cardinalities with IMDb datasets are much overestimated and the inaccuracy increases with the increase in sample sizes.

IMDb dataset: http://homepages.cwi.nl/~boncz/job/imdb.tgz (Please change the corresponding CSV file name to "keywords.csv")

In the attachment, two 10-datasets join queries and the relevant DDL statements are given. Both join queries are with 10 IMDb datasets and different selectivity predicates.

The actual cardinality of the first join query is 1298, where the estimated ones are:
- with "low" sample: 7.489 x 10¹⁰
- with "medium" sample: 5.619 x 10¹²
- with "high" sample: 6.022 x 10¹²
The actual cardinality of the second one is 1062, whereas the estimated ones are:
- with "low" sample: 4.33 x 10⁸
- with "medium" sample: 1.479 x 10¹¹
- with "high" sample: 1.93 x 10¹¹

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

schema.txt
28/Jul/24 00:30
2 kB
Mehnaz Tabassum Mahin
load_and_analyze.txt
28/Jul/24 00:31
2 kB
Mehnaz Tabassum Mahin
join_queries.txt
28/Jul/24 00:31
2 kB
Mehnaz Tabassum Mahin

Activity

People

Assignee:: murali krishna

Reporter:: Mehnaz Tabassum Mahin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Jul/24 00:43

Updated:: 16/Aug/24 21:25