[LUCENE-4619] Create a specialized path for facets counting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: modules/facet
Labels:
None

Lucene Fields:

New, Patch Available

Description

Mike and I have been discussing that on several issues (~~LUCENE-4600~~, ~~LUCENE-4602~~) and on GTalk ... it looks like the current API abstractions may be responsible for some of the performance loss that we see, compared to specialized code.

During our discussion, we've decided to target a specific use case - facets counting and work on it, top-to-bottom by reusing as much code as possible. Specifically, we'd like to implement a FacetsCollector/Accumulator which can do only counting (i.e. respects only CountFacetRequest), no sampling, partitions and complements. The API allows us to do so very cleanly, and in the context of that issue, we'd like to do the following:

Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams and CategoryPath (List, Iterable, whatever) and adds the needed information to both the taxonomy index as well as the search index.
- That API is similar in nature to CategoryDocumentBuilder, only easier to consume – it's just another field that you add to the Document.
- We'll have two extensions for it: PayloadFacetsField and DocValuesFacetsField, so that we can benchmark the two approaches. Eventually, one of them we believe, will be eliminated, and we'll remain w/ just one (hopefully the DV one).

Implement either a FacetsAccumulator/Collector which takes a bunch of CountFacetRequests and returns the top-counts.
- Aggregations are done in-collection, rather than post. Note that we have ~~LUCENE-4600~~ open for exploring that. Either we finish this exploration here, or do it there. Just FYI that the issue exists.
- Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open a separate issue to explore improving that API to be bulk, and then we can decide if this specialized Collector should use those abstractions, or be really optimized for the facet counting case.

At the moment, this path will assume that a document holds multiple dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike for a document), and therefore use OrdPolicy.NO_PARENTS.
- Later, we'd like to explore how to have this specialized path handle the ALL_PARENTS case too, as it shouldn't be so hard to do.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-4619.patch
12/Dec/12 13:54
10 kB
Michael McCandless

Issue Links

Is contained by

LUCENE-4600 Explore facets aggregation during documents collection

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Shai Erera

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Dec/12 13:06

Updated:: 28/Aug/22 13:34

Resolved:: 21/Jan/13 07:01