[OAK-7300] Lucene Index: per-column selectivity to improve cost estimation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: lucene, query
Labels:
None

Epic Link:
indexer resilience

Description

In ~~OAK-6735~~ we have improved cost estimation for Lucene indexes, however the following case is still not working as expected: a very common property is indexes (many nodes have that property), and each value of that property is more or less unique. In this case, currently the cost estimation is the total number of documents that contain that property. Assuming the condition "property is not null" this is correct, however for the common case "property = x" the estimated cost is far too high.

A known workaround is to set the "costPerEntry" for the given index to a low value, for example 0.2. However this isn't a good solution, as it affects all properties and queries.

It would be good to be able to set the selectivity per property, for example by specifying the number of distinct values, or (better yet) the average number of entries for a given key (1 for unique values, 2 meaning for each distinct values there are two documents on average).

That value can be set manually (cost override), and it can be set automatically, e.g. when building the index, or updated from time to time during the index update, using a cardinality
estimation algorithm. That doesn't have to be accurate; we could use an rough approximation such as hyperbitbit.

Attachments

Issue Links

is related to

OAK-7379 Lucene Index: per-column selectivity, assume 5 unique entries

Closed

relates to

OAK-6735 Lucene Index: improved cost estimation by using document count per field

Closed

Activity

People

Assignee:: Thomas Mueller

Reporter:: Thomas Mueller

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Mar/18 14:27

Updated:: 19/Nov/19 11:36