Lucene - Core
  1. Lucene - Core
  2. LUCENE-4885

FacetsAccumulator set incorrect value for FacetResult.numValidDescendants

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 6.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This is cheap to compute, since the TopKFRH already must visit all non-zero-count ords under the FacetRequest.categoryPath.

      This can be useful to a front end, eg to know whether to present a "More..." under that dimension or not, whether to use a suggester like LinkedIn's facet UI, etc.

      1. LUCENE-4885.patch
        18 kB
        Shai Erera
      2. LUCENE-4885.patch
        17 kB
        Michael McCandless
      3. LUCENE-4885.patch
        14 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial patch.

        Show
        Michael McCandless added a comment - Initial patch.
        Hide
        Shai Erera added a comment -

        I think that this statistic needs to be on FacetResultNode, and not FacetResult, because if someone asks for "topK in each node (depth > 1)", we should compute this statistic on each node at each level. That today is supported only by StandardFA, but still. Initially, when I thought about this issue, I thought that FacetResult.numValidDescendants is the data you needed, but then realized it's not (it's the sum of topK at all levels).

        On FRN, it should be like the other fields, public and not set in the ctor.

        Show
        Shai Erera added a comment - I think that this statistic needs to be on FacetResultNode, and not FacetResult, because if someone asks for "topK in each node (depth > 1)", we should compute this statistic on each node at each level. That today is supported only by StandardFA, but still. Initially, when I thought about this issue, I thought that FacetResult.numValidDescendants is the data you needed, but then realized it's not (it's the sum of topK at all levels). On FRN, it should be like the other fields, public and not set in the ctor.
        Hide
        Michael McCandless added a comment -

        OK I'll move it to FRN instead ... that's a good idea.

        Show
        Michael McCandless added a comment - OK I'll move it to FRN instead ... that's a good idea.
        Hide
        Michael McCandless added a comment -

        New patch, moving the count to FacetResultNode.

        I'm sure it's possible to get this count for the TopKInEach case but I don't understand the code well enough so it's -1 for now.

        Also ... I'm not sure this count really belongs in the FRN: it seems quite wasteful in the common case of single root and lots of children to have this unused int on each child node. Or I guess the issue is that we don't have a separate class for the "leaf" vs for intermediate nodes. I guess the wasted int doesn't usually matter, since you usually don't request too many children, and these classes are transient ...

        Show
        Michael McCandless added a comment - New patch, moving the count to FacetResultNode. I'm sure it's possible to get this count for the TopKInEach case but I don't understand the code well enough so it's -1 for now. Also ... I'm not sure this count really belongs in the FRN: it seems quite wasteful in the common case of single root and lots of children to have this unused int on each child node. Or I guess the issue is that we don't have a separate class for the "leaf" vs for intermediate nodes. I guess the wasted int doesn't usually matter, since you usually don't request too many children, and these classes are transient ...
        Hide
        Shai Erera added a comment -

        I hear you. That's why I first, too, thought it should be on FacetResult, and that we can just use numValidDescendants. I'm torn here too, not sure what to do about TopKInEachNode. I guess that we could resolve this by separating FRN and TreeFRN, while FRN recording ordinal, value and label and TreeFRN recording subResults and uniqueChildrenCount? The problem is how to iterate on the results...

        Hmm, maybe the cut needs to be higher up at FacetResult level, with FacetResult (flat) and HierarchicalFacetResult. FacetResult will give you an Iterator<FRN> and if it's Hierarchical or TreeFacetResult), you get in practice an Iterator<TreeFRN> ... also, FR will record uniqueChildrenCount. For TreeFR this can mean the total unique count, or just at the root level...

        Today, for the common case, we spend a List pointer at each FRN, which is null. Perhaps, as you say, it's usually a non issue since you ask for a low K ...

        One other solution would be to hide the uniqueChildrenCount and subResults behind methods, so that FRN returns -1 and null, while TreeFRN implements them accordingly?

        In general, I feel that FacetResult and FRN are too verbose. I think that the root node's statistics could be available from the FacetResult level, and that it could give you an iterator directly, rather than giving you a node which you always must call root.subResults... but it's for a separate issue.

        Regarding TopKInEachNode, I wonder if we should eliminate that completely from FacetRequest, and create an appropriate FacetResultHander which you configure with the levels that you want, including (potentially, in the future) different K values for different levels?

        Show
        Shai Erera added a comment - I hear you. That's why I first, too, thought it should be on FacetResult, and that we can just use numValidDescendants. I'm torn here too, not sure what to do about TopKInEachNode. I guess that we could resolve this by separating FRN and TreeFRN, while FRN recording ordinal, value and label and TreeFRN recording subResults and uniqueChildrenCount? The problem is how to iterate on the results... Hmm, maybe the cut needs to be higher up at FacetResult level, with FacetResult (flat) and HierarchicalFacetResult. FacetResult will give you an Iterator<FRN> and if it's Hierarchical or TreeFacetResult), you get in practice an Iterator<TreeFRN> ... also, FR will record uniqueChildrenCount. For TreeFR this can mean the total unique count, or just at the root level... Today, for the common case, we spend a List pointer at each FRN, which is null. Perhaps, as you say, it's usually a non issue since you ask for a low K ... One other solution would be to hide the uniqueChildrenCount and subResults behind methods, so that FRN returns -1 and null, while TreeFRN implements them accordingly? In general, I feel that FacetResult and FRN are too verbose. I think that the root node's statistics could be available from the FacetResult level, and that it could give you an iterator directly, rather than giving you a node which you always must call root.subResults... but it's for a separate issue. Regarding TopKInEachNode, I wonder if we should eliminate that completely from FacetRequest, and create an appropriate FacetResultHander which you configure with the levels that you want, including (potentially, in the future) different K values for different levels?
        Hide
        Michael McCandless added a comment -

        In general, I feel that FacetResult and FRN are too verbose. I think that the root node's statistics could be available from the FacetResult level, and that it could give you an iterator directly, rather than giving you a node which you always must call root.subResults... but it's for a separate issue.

        I agree!

        Also, sometimes you do need to pull a large number of facet results (the "See all..." case) ... and I wonder if we should just do parallel arrays, ie String[] labels and int[]/double[] values, for flat results, or only expose an iterator.

        Regarding TopKInEachNode, I wonder if we should eliminate that completely from FacetRequest, and create an appropriate FacetResultHander which you configure with the levels that you want, including (potentially, in the future) different K values for different levels?

        +1

        Show
        Michael McCandless added a comment - In general, I feel that FacetResult and FRN are too verbose. I think that the root node's statistics could be available from the FacetResult level, and that it could give you an iterator directly, rather than giving you a node which you always must call root.subResults... but it's for a separate issue. I agree! Also, sometimes you do need to pull a large number of facet results (the "See all..." case) ... and I wonder if we should just do parallel arrays, ie String[] labels and int[]/double[] values, for flat results, or only expose an iterator. Regarding TopKInEachNode, I wonder if we should eliminate that completely from FacetRequest, and create an appropriate FacetResultHander which you configure with the levels that you want, including (potentially, in the future) different K values for different levels? +1
        Hide
        Shai Erera added a comment -

        Let's worry about the rest in separate issues. For now, I'm fine with adding numValidChildren to FRN. Is the latest patch up to date, or you have updates? I'll try to add that stat to the StandardFA code once you upload the latest patch.

        Show
        Shai Erera added a comment - Let's worry about the rest in separate issues. For now, I'm fine with adding numValidChildren to FRN. Is the latest patch up to date, or you have updates? I'll try to add that stat to the StandardFA code once you upload the latest patch.
        Hide
        Michael McCandless added a comment -

        Latest patch is up-to-date. Thanks Shai.

        Show
        Michael McCandless added a comment - Latest patch is up-to-date. Thanks Shai.
        Hide
        Shai Erera added a comment -

        While looking at the top-k-in-each-node code, I realized that FacetResult.numValidDescendants is what we needed to use all along, and that new FacetsAccumulator code just incorrectly set it to K, rather than total num children considered. I added a test to TestFacetsCollector and fixed the code. So actually this issue turned from a new feature to a bug .

        Mike, thanks for doing the work to add it to FRN, it made the fix very easy - I just pulled it to FacetResult instead. All the tests that you added helped validate the fix.

        I think it's ready.

        Show
        Shai Erera added a comment - While looking at the top-k-in-each-node code, I realized that FacetResult.numValidDescendants is what we needed to use all along, and that new FacetsAccumulator code just incorrectly set it to K, rather than total num children considered. I added a test to TestFacetsCollector and fixed the code. So actually this issue turned from a new feature to a bug . Mike, thanks for doing the work to add it to FRN, it made the fix very easy - I just pulled it to FacetResult instead. All the tests that you added helped validate the fix. I think it's ready.
        Hide
        Michael McCandless added a comment -

        +1, thanks Shai!

        Show
        Michael McCandless added a comment - +1, thanks Shai!
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x. Thanks Mike!

        Show
        Shai Erera added a comment - Committed to trunk and 4x. Thanks Mike!
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development