Solr
  1. Solr
  2. SOLR-2762

FSTLookup returns one less suggestion than it should when onlyMorePopular=true

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.3, 3.4
    • Fix Version/s: 3.5
    • Component/s: spellchecker
    • Labels:
      None

      Description

      I'm using the Suggester. When I switched from TSTLookup to FSTLookup, I noticed that it returned one fewer suggestion than what I asked for. I have spellcheck.onlyMorePopular=true; when I set it to false, I see the correct count. Another aspect of the bug is that this off-by-one bug only seems to occur when my suggestion has an exact match.

      1. SOLR-2762_FSTLookup_off_by_one.patch
        2 kB
        David Smiley
      2. SOLR-2762_FSTLookup_off_by_one.patch
        2 kB
        David Smiley
      3. SOLR-2762.patch
        4 kB
        Dawid Weiss
      4. SOLR-2762.patch
        6 kB
        Dawid Weiss
      5. SOLR-2762.patch
        6 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Bulk close after 3.5 is released

          Show
          Uwe Schindler added a comment - Bulk close after 3.5 is released
          Hide
          Dawid Weiss added a comment -

          Ugh... David, you're a machine (5:20am). No, I wasn't teasing, I was really done for the day yesterday. I will commit today, thanks for your help again!

          Show
          Dawid Weiss added a comment - Ugh... David, you're a machine (5:20am). No, I wasn't teasing, I was really done for the day yesterday. I will commit today, thanks for your help again!
          Hide
          David Smiley added a comment -

          The patch looks good to me, Dawid. You tease me with asking me to commit it; I can't :-/

          Show
          David Smiley added a comment - The patch looks good to me, Dawid. You tease me with asking me to commit it; I can't :-/
          Hide
          Dawid Weiss added a comment -

          Hope I didn't screw up anything. Commit please, I'm done for the day

          Show
          Dawid Weiss added a comment - Hope I didn't screw up anything. Commit please, I'm done for the day
          Hide
          Dawid Weiss added a comment -

          That TED 4am video was entertaining but weird. Normally I think of TED video as educational / inspiring but that was a bit random.

          It's "Technology Entertainment and Design", there's actually plenty of pure entertainers there (musicians, comedians). That 4am was quite funny to me, so I couldn't resist.

          I will update to remove before insert, thanks. I didn't think of that. All your other remarks hold, but they're super-minor improvements compared to iterating over the fsa or sorting scores (which we will need to do to make the bucketing proportional), so I'd sacrifice them for code clarity this time.

          Show
          Dawid Weiss added a comment - That TED 4am video was entertaining but weird. Normally I think of TED video as educational / inspiring but that was a bit random. It's "Technology Entertainment and Design", there's actually plenty of pure entertainers there (musicians, comedians). That 4am was quite funny to me, so I couldn't resist. I will update to remove before insert, thanks. I didn't think of that. All your other remarks hold, but they're super-minor improvements compared to iterating over the fsa or sorting scores (which we will need to do to make the bucketing proportional), so I'd sacrifice them for code clarity this time.
          Hide
          David Smiley added a comment -

          That TED 4am video was entertaining but weird. Normally I think of TED video as educational / inspiring but that was a bit random.

          I like your rename of "greedy" to "collectAll".

          The code is definitely easier to read with a remove and then an add, even if it internally does more work than my loop did. I actually considered that approach but I know how performance-contious the developers here are so I didn't do that. I hoped ArrayList had some sort of shift method but it doesn't.

          By the way, you forgot to copy the code improvement I put in my patch to your existing logic that I told you about. At your comment: "// Insert as the first result and truncate at num.", you did an add then a remove, when you should do a remove then an add in order to avoid potentially needlessly expanding ArrayList's internal array.

          Show
          David Smiley added a comment - That TED 4am video was entertaining but weird. Normally I think of TED video as educational / inspiring but that was a bit random. I like your rename of "greedy" to "collectAll". The code is definitely easier to read with a remove and then an add, even if it internally does more work than my loop did. I actually considered that approach but I know how performance-contious the developers here are so I didn't do that. I hoped ArrayList had some sort of shift method but it doesn't. By the way, you forgot to copy the code improvement I put in my patch to your existing logic that I told you about. At your comment: "// Insert as the first result and truncate at num.", you did an add then a remove, when you should do a remove then an add in order to avoid potentially needlessly expanding ArrayList's internal array.
          Hide
          Dawid Weiss added a comment -

          Corrected swap() (preserve score order if pushing exact match to the front).

          Show
          Dawid Weiss added a comment - Corrected swap() (preserve score order if pushing exact match to the front).
          Hide
          Dawid Weiss added a comment -

          4am, you say? You should watch this one: http://www.ted.com/talks/rives_on_4_a_m.html

          Anyway, I didn't commit mostly because I'm in and out of office today too. You're right about the ordering – it will change with the swap() that I used. I'm thinking it'll be easier code-wise to shift by removing and then re-adding the matching element. I'll prepare a patch, won't commit.

          Show
          Dawid Weiss added a comment - 4am, you say? You should watch this one: http://www.ted.com/talks/rives_on_4_a_m.html Anyway, I didn't commit mostly because I'm in and out of office today too. You're right about the ordering – it will change with the swap() that I used. I'm thinking it'll be easier code-wise to shift by removing and then re-adding the matching element. I'll prepare a patch, won't commit.
          Hide
          David Smiley added a comment -

          Please don't commit in only a few hours. That is too soon for something being put together now and it's 4am in my time zone – I haven't gone to bed yet.

          I looked at your patch. Your code will only swap the existing match with the first entry, but doesn't that hurt any existing ordering? My code shifted them down to preserve ordering. Hey by the way, you did your own swap code when the JDK already has Collections.swap().

          Show
          David Smiley added a comment - Please don't commit in only a few hours. That is too soon for something being put together now and it's 4am in my time zone – I haven't gone to bed yet. I looked at your patch. Your code will only swap the existing match with the first entry, but doesn't that hurt any existing ordering? My code shifted them down to preserve ordering. Hey by the way, you did your own swap code when the JDK already has Collections.swap().
          Hide
          Dawid Weiss added a comment -

          This is my take at fixing this issue. Took David's suggestion about complexity into account and moved the checks into a separate method. A previously failing test case added and passing. David, feel free to check now, should be fine. I plan to commit in a few hours.

          Show
          Dawid Weiss added a comment - This is my take at fixing this issue. Took David's suggestion about complexity into account and moved the checks into a separate method. A previously failing test case added and passing. David, feel free to check now, should be fine. I plan to commit in a few hours.
          Hide
          Dawid Weiss added a comment -

          I realize when this could happen, indeed – trivial. Thanks David.

          Show
          Dawid Weiss added a comment - I realize when this could happen, indeed – trivial. Thanks David.
          Hide
          Dawid Weiss added a comment -

          The patch is fine and I think I now know the circumstances in which the off by one could happen (or more specifically: when the exact match could be on the list at the point of getting to that snippet). I'll try to write a failing test case and then apply your patch.

          Show
          Dawid Weiss added a comment - The patch is fine and I think I now know the circumstances in which the off by one could happen (or more specifically: when the exact match could be on the list at the point of getting to that snippet). I'll try to write a failing test case and then apply your patch.
          Hide
          David Smiley added a comment -

          Here is a trivial update to my patch. I forgot to add a "break". Without the break the code worked but needlessly checked for alreadyFoundIt after it was already true.

          Show
          David Smiley added a comment - Here is a trivial update to my patch. I forgot to add a "break". Without the break the code worked but needlessly checked for alreadyFoundIt after it was already true.
          Hide
          Dawid Weiss added a comment -

          I will review it later today. The code is based on several assumptions to speed things up (for example it assumes the fst yields sorted output). I will try to write a failing test case first.

          @cyclomatic complexity: the conditions inside loops are meant to early-interrupt the search. That's where the gain is and it'll be hard to get rid of in a clean way.

          @min(10, num): I typically try to avoid overallocating buffers; if somebody requests > 10 matches (unlikely?) then we will gracefully expand the buffers as we go. Also: the point of doing this is so that somebody could request all matches by passing Integer.MAX_VALUE which would otherwise cause an OOM.

          Show
          Dawid Weiss added a comment - I will review it later today. The code is based on several assumptions to speed things up (for example it assumes the fst yields sorted output). I will try to write a failing test case first. @cyclomatic complexity: the conditions inside loops are meant to early-interrupt the search. That's where the gain is and it'll be hard to get rid of in a clean way. @min(10, num): I typically try to avoid overallocating buffers; if somebody requests > 10 matches (unlikely?) then we will gracefully expand the buffers as we go. Also: the point of doing this is so that somebody could request all matches by passing Integer.MAX_VALUE which would otherwise cause an OOM.
          Hide
          David Smiley added a comment -

          The bug is that the "if (exactMatchFirst) {" condition fails to consider the fact that an exact match is already in the results. As I was fooling around with my data, it was usually already there.

          I added code to check to see if the key is already present and if so to shift it to the top. I also made a trivial change to the existing code you wrote that was in this condition that changed the order of shrinking the result array before inserting a new LookupResult into the first position. This will avoid ArrayList needlessly growing its array. As an aside, it's not clear to me why you did a min(10,num) when initializing the ArrayList capacity.

          I have to admit that this code seems overall tricky to test due to all the branch conditions. And this method probably has a high cyclomatic complexity. It would be nice to move the "if (exactMatchFirst) {" branch outside the loops. At a glance it seems doable but then I noticed getExactMatchStartingFromRootArc(i,key) took the loop index 'i' and I'm not sure what the implications are there.

          Show
          David Smiley added a comment - The bug is that the "if (exactMatchFirst) {" condition fails to consider the fact that an exact match is already in the results. As I was fooling around with my data, it was usually already there. I added code to check to see if the key is already present and if so to shift it to the top. I also made a trivial change to the existing code you wrote that was in this condition that changed the order of shrinking the result array before inserting a new LookupResult into the first position. This will avoid ArrayList needlessly growing its array. As an aside, it's not clear to me why you did a min(10,num) when initializing the ArrayList capacity. I have to admit that this code seems overall tricky to test due to all the branch conditions. And this method probably has a high cyclomatic complexity. It would be nice to move the "if (exactMatchFirst) {" branch outside the loops. At a glance it seems doable but then I noticed getExactMatchStartingFromRootArc(i,key) took the loop index 'i' and I'm not sure what the implications are there.
          Hide
          Dawid Weiss added a comment -

          Thanks a lot.

          Show
          Dawid Weiss added a comment - Thanks a lot.
          Hide
          David Smiley added a comment -

          I will try and debug this tonight or within a few days against my data and point out what the problem is.

          Show
          David Smiley added a comment - I will try and debug this tonight or within a few days against my data and point out what the problem is.
          Hide
          Dawid Weiss added a comment -

          The relevant snippet for the scenario you described (exactMatchFirst, onlyMorePopular, input match is exactly represented in the suggestions list) is here:

          
                  if (collect(res, num, weight, output, arc) && greedy) {
                    // We have enough suggestions to return immediately. Keep on looking for an
                    // exact match, if requested.
                    if (exactMatchFirst) {
                      Float exactMatchWeight = getExactMatchStartingFromRootArc(i, key);
                      if (exactMatchWeight != null) {
                        res.add(0, new LookupResult(key, exactMatchWeight));
                        while (res.size() > num) {
                          res.remove(res.size() - 1);
                        }
                      }
                    }
                    break;
                  }
          

          the only place where elements are removed from the suggestion list is in the while loop... but the condition is correct so I've no idea.

          Show
          Dawid Weiss added a comment - The relevant snippet for the scenario you described (exactMatchFirst, onlyMorePopular, input match is exactly represented in the suggestions list) is here: if (collect(res, num, weight, output, arc) && greedy) { // We have enough suggestions to return immediately. Keep on looking for an // exact match, if requested. if (exactMatchFirst) { Float exactMatchWeight = getExactMatchStartingFromRootArc(i, key); if (exactMatchWeight != null ) { res.add(0, new LookupResult(key, exactMatchWeight)); while (res.size() > num) { res.remove(res.size() - 1); } } } break ; } the only place where elements are removed from the suggestion list is in the while loop... but the condition is correct so I've no idea.
          Hide
          Dawid Weiss added a comment -

          I've been looking at this code for a longer while but I don't see how it can be off-by-one... Can you send me the snippet of code/ data you used while testing this?

          Show
          Dawid Weiss added a comment - I've been looking at this code for a longer while but I don't see how it can be off-by-one... Can you send me the snippet of code/ data you used while testing this?
          Hide
          Dawid Weiss added a comment -

          Ok, I'll try to reproduce on my own, thanks.

          Show
          Dawid Weiss added a comment - Ok, I'll try to reproduce on my own, thanks.
          Hide
          Michael McCandless added a comment -

          I thought it best to at least report the problem instead of do nothing.

          +1

          Show
          Michael McCandless added a comment - I thought it best to at least report the problem instead of do nothing. +1
          Hide
          David Smiley added a comment -

          Maybe some day; sorry. I'm trying to push out the 2nd edition of my Solr book and I ran into this when kicking the tires on the Suggester. I thought it best to at least report the problem instead of do nothing.

          Show
          David Smiley added a comment - Maybe some day; sorry. I'm trying to push out the 2nd edition of my Solr book and I ran into this when kicking the tires on the Suggester. I thought it best to at least report the problem instead of do nothing.
          Hide
          Dawid Weiss added a comment -

          David, can you add a test case for this? I'll be happy to fix it, probably something trivial, but a failing test case would be a great start.

          Show
          Dawid Weiss added a comment - David, can you add a test case for this? I'll be happy to fix it, probably something trivial, but a failing test case would be a great start.

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development