[SOLR-15008] Avoid building OrdinalMap for each facet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 8.7
Fix Version/s: None
Component/s: Facet Module
Labels:
- performance

Description

I'm running against the following scenario:

[JSON] faceting on a high cardinality field
few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit):

    "QTime":3869,
    "params":{
      "json":"{\"query\": \"*:*\",
      \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"]
      \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
      "rows":"0"}},
  "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
    "count":333,
    "keywords":{
      "buckets":[{
          "val":"value1",
          "count":124},
  ...

I did some profiling with our Sematext Monitoring and it points me to OrdinalMap building (see attached screenshot). If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case).

If I'm right up to this point, I see a couple of potential improvements, inspired from Elasticsearch:

Keep the OrdinalMap cached until the next softCommit, so that only the first query takes the penalty
Allow faceting on actual values (a Map) rather than ordinals, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents

I'm curious about what you're thinking:

would a PR/patch be welcome for any of the two ideas above?
do you see better options? am I missing something?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screenshot 2020-11-19 at 12.01.55.png
19/Nov/20 10:50
263 kB
Radu Gheorghe
writes_commits.png
20/Nov/20 07:33
347 kB
Radu Gheorghe

Activity

People

Assignee:: Unassigned

Reporter:: Radu Gheorghe

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Nov/20 11:04

Updated:: 13/Feb/21 00:09

Resolved:: 23/Nov/20 16:42