Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Mike experimented with encoding just the exact categories ordinals on LUCENE-4602, and I added OrdinalPolicy.NO_PARENTS, with a comment saying that this requires a special FacetsAccumulator.

      The idea is to write the exact categories only for each document, and then at search time count up the parents chain to compute requested facets (I say count, but it can be any weight).

      One limitation of such accumulator is that it cannot be used when e.g. a document is associated with two categories who share the same parent, because that may result in incorrect weights computed (e.g. a document might have several Authors, and so counting the Author facet may yield wrong counts). So it can be used only when the app knows it doesn't add such facets, or that it always asks to aggregate a 'root' that in its path this criteria doesn't hold (no categories share the same parent).

        Activity

        Hide
        Shai Erera added a comment -

        Borrowing a comment I made on LUCENE-4602 - we should pull the parents[] from TaxoReader, rather than call getParent for every ordinal. Though I didn't benchmark it, when I called tr.getParent() in TestDirTaxoWriter.testConcurrency, which generates a large taxonomy, I saw nearly 3x speedups after moving to pulling the array. 100 iterations dropped from 150s to 60s.

        Show
        Shai Erera added a comment - Borrowing a comment I made on LUCENE-4602 - we should pull the parents[] from TaxoReader, rather than call getParent for every ordinal. Though I didn't benchmark it, when I called tr.getParent() in TestDirTaxoWriter.testConcurrency, which generates a large taxonomy, I saw nearly 3x speedups after moving to pulling the array. 100 iterations dropped from 150s to 60s.
        Hide
        Shai Erera added a comment -

        Talked to Gilad about it, he reminded me of the Aggregator class which is responsible for aggregating a document's ordinals. I.e., this is where you would tap in to e.g. compute functions on ordinals, aggregate by doc score etc.

        So what we should do is write a NoParentsAggregator, which for every ordinal, counts the parents too. Such aggregator can be implemented to produce 'correct' counts, because in the context of a single document, it can know that it already counted a parent. If we do that, users will only need to set the right Aggregator on the FacetRequest, and we will still be able to use StandardFacetsAccumulator, which handles partitions as well (we don't want to duplicate much code).

        I'll look into it, and if it makes sense, rename the issue too.

        Show
        Shai Erera added a comment - Talked to Gilad about it, he reminded me of the Aggregator class which is responsible for aggregating a document's ordinals. I.e., this is where you would tap in to e.g. compute functions on ordinals, aggregate by doc score etc. So what we should do is write a NoParentsAggregator, which for every ordinal, counts the parents too. Such aggregator can be implemented to produce 'correct' counts, because in the context of a single document, it can know that it already counted a parent. If we do that, users will only need to set the right Aggregator on the FacetRequest, and we will still be able to use StandardFacetsAccumulator, which handles partitions as well (we don't want to duplicate much code). I'll look into it, and if it makes sense, rename the issue too.
        Hide
        Michael McCandless added a comment -

        I think counting the parents ordinals on the fly is going to be much more costly than aggregating up only in the end?

        I suspect that was a big part of the gains I saw, since it means we only count 1 int not 3 in my test (but we should separately test it). I realize that means the NoParentsAggregator would not be "general purpose", because you couldn't use it on multi-valued fields, but I suspect in the common case many facet dimensions are single-valued.

        Also, for the multi-valued case, having NoParentsAccumulator that must dedup on-the-fly is likely to be expensive? Ie I think it's likely for the multi-valued case that you will want to dedup at indexing time and store the full-path ords in the index (ie what we do today by default)?

        Show
        Michael McCandless added a comment - I think counting the parents ordinals on the fly is going to be much more costly than aggregating up only in the end? I suspect that was a big part of the gains I saw, since it means we only count 1 int not 3 in my test (but we should separately test it). I realize that means the NoParentsAggregator would not be "general purpose", because you couldn't use it on multi-valued fields, but I suspect in the common case many facet dimensions are single-valued. Also, for the multi-valued case, having NoParentsAccumulator that must dedup on-the-fly is likely to be expensive? Ie I think it's likely for the multi-valued case that you will want to dedup at indexing time and store the full-path ords in the index (ie what we do today by default)?
        Hide
        Shai Erera added a comment -

        I think counting the parents ordinals on the fly is going to be much more costly than aggregating up only in the end?

        That's why we have benchmark, no?

        I understand what you say, so let's say that we can have NoParentsAggregator, which dedups and works properly, counting on the fly. And if performance is not as good as NoParentsAccumulator, then we do both, documenting the accumulator's limitations?

        Show
        Shai Erera added a comment - I think counting the parents ordinals on the fly is going to be much more costly than aggregating up only in the end? That's why we have benchmark, no? I understand what you say, so let's say that we can have NoParentsAggregator, which dedups and works properly, counting on the fly. And if performance is not as good as NoParentsAccumulator, then we do both, documenting the accumulator's limitations?
        Hide
        Michael McCandless added a comment -

        I understand what you say, so let's say that we can have NoParentsAggregator, which dedups and works properly, counting on the fly. And if performance is not as good as NoParentsAccumulator, then we do both, documenting the accumulator's limitations?

        +1 for sure

        I feel like we are gonna need something, somewhere, which looks at FSP (maybe with some "additions", to include things like "these N dimensions are single-valued"), and then picks the fastest accumulator/aggregator/collector, that will give correct counts, for your situation.

        Show
        Michael McCandless added a comment - I understand what you say, so let's say that we can have NoParentsAggregator, which dedups and works properly, counting on the fly. And if performance is not as good as NoParentsAccumulator, then we do both, documenting the accumulator's limitations? +1 for sure I feel like we are gonna need something, somewhere, which looks at FSP (maybe with some "additions", to include things like "these N dimensions are single-valued"), and then picks the fastest accumulator/aggregator/collector, that will give correct counts, for your situation.
        Hide
        Gilad Barkai added a comment -

        Since taxonomies do not tend to change in depth (i.e, if FileType has only depth 1, it is not likely it will suddenly grow to depth 3) - than perhaps the definition of "these N dimensions are single-valued" should be set in a CategoryListParam ?

        Shai mentioned yesterday that adding the parents during aggregation is problematic when partitions are in place. Parents might not be on the same partition as their children, nor the aggregator is aware of the partition it is currently aggregating upon. For partitions it seems adding the parents in a post process - as was first suggested - is the right approach.

        Show
        Gilad Barkai added a comment - Since taxonomies do not tend to change in depth (i.e, if FileType has only depth 1, it is not likely it will suddenly grow to depth 3) - than perhaps the definition of "these N dimensions are single-valued" should be set in a CategoryListParam ? Shai mentioned yesterday that adding the parents during aggregation is problematic when partitions are in place. Parents might not be on the same partition as their children, nor the aggregator is aware of the partition it is currently aggregating upon. For partitions it seems adding the parents in a post process - as was first suggested - is the right approach.
        Hide
        Shai Erera added a comment -

        Patch implements NoParentsAggregator and a matching test. Note the nocommits - the aggregator still does not dedup parents as it counts them. I wonder what's the best way to do it, i.e. a FixedBitSet, an IntHashSet or something else. If you have any insights, please share !

        Show
        Shai Erera added a comment - Patch implements NoParentsAggregator and a matching test. Note the nocommits - the aggregator still does not dedup parents as it counts them. I wonder what's the best way to do it, i.e. a FixedBitSet, an IntHashSet or something else. If you have any insights, please share !
        Hide
        Shai Erera added a comment -

        LUCENE-4600 added CountingFacetsCollector which supports OrdinalPolicy.NO_PARENTS (i.e. index only leaf nodes). Benchmark tests show that it performs better than when all parents are indexed (OrdinalPolicy.ALL_PARENTS), probably because less data is read from disk as well as less computation done during search.

        We might therefore make NO_PARENTS the default OrdinalPolicy, and therefore need to address it for StandardFacetsAccumulator, perhaps in the form of a NoParentsAccumulator.

        Show
        Shai Erera added a comment - LUCENE-4600 added CountingFacetsCollector which supports OrdinalPolicy.NO_PARENTS (i.e. index only leaf nodes). Benchmark tests show that it performs better than when all parents are indexed (OrdinalPolicy.ALL_PARENTS), probably because less data is read from disk as well as less computation done during search. We might therefore make NO_PARENTS the default OrdinalPolicy, and therefore need to address it for StandardFacetsAccumulator, perhaps in the form of a NoParentsAccumulator.
        Hide
        Shai Erera added a comment -

        Based on the results from LUCENE-4600, it doesn't look like NO_PARENTS will be made the default. And even regardless of the results, it's not a good default because the user can easily make a mistake and index multiple CPs that share a parent in one document. NO_PARENTS is an optimization mostly, while ALL_PARENTS cannot make a mistake (i.e. over count parents). Therefore it should be treated as such.

        Also, I've been thinking about what would it take to support NO_PARENTS by StandardFacetsCollector and it's not so trivial. Is it an Aggregator, an Accumulator? How does it work in conjunction with partitions, complements, sampling? And how does it fit with LUCENE-4700?

        There are a lot of open questions. I think that for now the safest thing would be to document that only CountingFacetsCollector works with it (given its limited support), and do a hard check in StandardFacetsCollector (or FacetsCollector.create) and fail fast.

        Show
        Shai Erera added a comment - Based on the results from LUCENE-4600 , it doesn't look like NO_PARENTS will be made the default. And even regardless of the results, it's not a good default because the user can easily make a mistake and index multiple CPs that share a parent in one document. NO_PARENTS is an optimization mostly, while ALL_PARENTS cannot make a mistake (i.e. over count parents). Therefore it should be treated as such. Also, I've been thinking about what would it take to support NO_PARENTS by StandardFacetsCollector and it's not so trivial. Is it an Aggregator, an Accumulator? How does it work in conjunction with partitions, complements, sampling? And how does it fit with LUCENE-4700 ? There are a lot of open questions. I think that for now the safest thing would be to document that only CountingFacetsCollector works with it (given its limited support), and do a hard check in StandardFacetsCollector (or FacetsCollector.create) and fail fast.
        Hide
        Shai Erera added a comment -

        We have OrdinalPolicy.NO_PARENTS with a warning not to use it when a document will share few categories with common ancestors. I don't think that at this point we should do anything more.

        Show
        Shai Erera added a comment - We have OrdinalPolicy.NO_PARENTS with a warning not to use it when a document will share few categories with common ancestors. I don't think that at this point we should do anything more.

          People

          • Assignee:
            Unassigned
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development