Lucene - Core
  1. Lucene - Core
  2. LUCENE-2939

Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: modules/highlighter
    • Labels:
      None

      Description

      huge documents can be drastically slower than need be because the entire field is added to the memory index
      this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze

      things can be improved even further by respecting this setting with CachingTokenStream

      1. LUCENE-2939.patch
        11 kB
        Mark Miller
      2. LUCENE-2939.patch
        10 kB
        Mark Miller
      3. LUCENE-2939.patch
        9 kB
        Mark Miller
      4. LUCENE-2939.patch
        7 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Bulk closing for 3.2

          Show
          Robert Muir added a comment - Bulk closing for 3.2
          Hide
          Mark Miller added a comment -

          Okay - I'm going to commit to trunk shortly.

          Show
          Mark Miller added a comment - Okay - I'm going to commit to trunk shortly.
          Hide
          Grant Ingersoll added a comment -

          Mark,

          Seems like we can move forward with this now that the release is out. Do you have time or do you want me to take it?

          Show
          Grant Ingersoll added a comment - Mark, Seems like we can move forward with this now that the release is out. Do you have time or do you want me to take it?
          Hide
          Mark Miller added a comment -

          here is a more up to date version of the patch for trunk - good for testing performance difference of this issue

          Show
          Mark Miller added a comment - here is a more up to date version of the patch for trunk - good for testing performance difference of this issue
          Hide
          Mark Miller added a comment -

          The patch needs more than a simple back port. The patch needs to be fixed.

          And that is simple too if you follow the above comments.

          You should pop the offset calculation into the if statement -

          I'm not convinced it's a problem in this situation (especially for someone wanting to try a patch), because this works one document at a time.

          Its also simple not to break the api as I mention above.

          I have done all of these things in my own work earlier (and added a test for the new filter) - took about 2 minutes.

          Eventually I will post another trunk patch.

          Doing a solid review and back port of this patch would not take long - it's fairly simple. I won't likely get to it for 3.X for a while though.

          Show
          Mark Miller added a comment - The patch needs more than a simple back port. The patch needs to be fixed. And that is simple too if you follow the above comments. You should pop the offset calculation into the if statement - I'm not convinced it's a problem in this situation (especially for someone wanting to try a patch), because this works one document at a time. Its also simple not to break the api as I mention above. I have done all of these things in my own work earlier (and added a test for the new filter) - took about 2 minutes. Eventually I will post another trunk patch. Doing a solid review and back port of this patch would not take long - it's fairly simple. I won't likely get to it for 3.X for a while though.
          Hide
          Robert Muir added a comment -

          The patch needs more than a simple back port.

          The patch needs to be fixed. It has tokenstream reuse bugs that cause offsets from last token of the previous document to be applied to the calculations of the next document, because it reads dirty attributes.

          Its not just release manager being an asshole here, there are technical problems that need to be fixed.

          Show
          Robert Muir added a comment - The patch needs more than a simple back port. The patch needs to be fixed. It has tokenstream reuse bugs that cause offsets from last token of the previous document to be applied to the calculations of the next document, because it reads dirty attributes. Its not just release manager being an asshole here, there are technical problems that need to be fixed.
          Hide
          Mark Miller added a comment -

          Sorry if I missed it in this thread, which branch was this patch made against? It doesn't apply cleanly against branch_3x.

          This patch is against trunk - still needs a fairly simple back port to 3x.

          Show
          Mark Miller added a comment - Sorry if I missed it in this thread, which branch was this patch made against? It doesn't apply cleanly against branch_3x. This patch is against trunk - still needs a fairly simple back port to 3x.
          Hide
          Trey Hyde added a comment -

          Sorry if I missed it in this thread, which branch was this patch made against? It doesn't apply cleanly against branch_3x.

          Show
          Trey Hyde added a comment - Sorry if I missed it in this thread, which branch was this patch made against? It doesn't apply cleanly against branch_3x.
          Hide
          Mark Miller added a comment -

          I will also note, there was never any strong argument to include this issue.

          There was never any danger of this needing to be strong armed out of 3.1.

          I've already said I wouldn't do it - and Grant had volunteered, but never argued for it either.

          Show
          Mark Miller added a comment - I will also note, there was never any strong argument to include this issue. There was never any danger of this needing to be strong armed out of 3.1. I've already said I wouldn't do it - and Grant had volunteered, but never argued for it either.
          Hide
          Mark Miller added a comment -

          Moving
          an issue out, stating that an issue won't make the RC,
          is fair game. These are the normal "tools" of the RM...

          Not in my experience.

          Regardless, I'm not sure this is the same as changing a JIRA issue right after someone else changes it with no discussion and apparent lack of understanding of the issue. That's a statement if you ask me. If you look at how the culture of Lucene has worked, this is unusual - and I'll push to make it remain so.

          Putting pressure is precisely what Robert has done here (and onSOLR-2390)?

          SOLR-2390 was this issue - this patch spans lucene and solr and covers both of them. This issue was still marked 3.1 and we where discussing it when this happened with SOLR-2390 - this is how I know little thought went into shoving it out - SOLR-2390 should be in lock step with this issue. It's not a new or another issue. It's just where I am tracking this from a Solr user bug perspective - it's easier to have one patch.

          My point is, we all must expect/allow/not-get-upset-about this
          "pressure" from the RM – it comes with the territory, and it's the
          RM's right to be very aggressive in order to get the release done.

          Depends - I've seen a lot of RM's before - and I have been one. Personally I've never seen things done this way. Nor do I think it was necessary. We would have come to the same conclusion in either case. The history and culture of Lucene has been to not be forceful in JIRA - that's something I'll argue to maintain.

          We gotta remove the barriers to doing releases around here, not add to
          them.

          Release at all costs is just not an excuse IMO. We release as often as someone is willing to put in the somewhat massive effort really.

          In fact, we should scrutinize our scary ReleaseTodo and pare it

          back to the bare minimum... it's gotta become a push button process.

          I think we all agree with that. I'm still not aboard with the scary RM theory.

          Show
          Mark Miller added a comment - Moving an issue out, stating that an issue won't make the RC, is fair game. These are the normal "tools" of the RM... Not in my experience. Regardless, I'm not sure this is the same as changing a JIRA issue right after someone else changes it with no discussion and apparent lack of understanding of the issue. That's a statement if you ask me. If you look at how the culture of Lucene has worked, this is unusual - and I'll push to make it remain so. Putting pressure is precisely what Robert has done here (and onSOLR-2390)? SOLR-2390 was this issue - this patch spans lucene and solr and covers both of them. This issue was still marked 3.1 and we where discussing it when this happened with SOLR-2390 - this is how I know little thought went into shoving it out - SOLR-2390 should be in lock step with this issue. It's not a new or another issue. It's just where I am tracking this from a Solr user bug perspective - it's easier to have one patch. My point is, we all must expect/allow/not-get-upset-about this "pressure" from the RM – it comes with the territory, and it's the RM's right to be very aggressive in order to get the release done. Depends - I've seen a lot of RM's before - and I have been one. Personally I've never seen things done this way. Nor do I think it was necessary. We would have come to the same conclusion in either case. The history and culture of Lucene has been to not be forceful in JIRA - that's something I'll argue to maintain. We gotta remove the barriers to doing releases around here, not add to them. Release at all costs is just not an excuse IMO. We release as often as someone is willing to put in the somewhat massive effort really. In fact, we should scrutinize our scary ReleaseTodo and pare it back to the bare minimum... it's gotta become a push button process. I think we all agree with that. I'm still not aboard with the scary RM theory.
          Hide
          Michael McCandless added a comment -

          Putting pressure is fine, but just because being an RM is hard is not a King for a day pass IMO

          Putting pressure is precisely what Robert has done here (and on
          SOLR-2390)?

          He's acting just like an RM should act, as far as I can tell. Moving
          an issue out, stating that an issue won't make the RC,
          is fair game. These are the normal "tools" of the RM...

          My point is, we all must expect/allow/not-get-upset-about this
          "pressure" from the RM – it comes with the territory, and it's the
          RM's right to be very aggressive in order to get the release done.

          Else releases will not get done, and we'll all keep one-more-thing'ing
          the release, and that's bad for all of us.

          We gotta remove the barriers to doing releases around here, not add to
          them. In fact, we should scrutinize our scary ReleaseTodo and pare it
          back to the bare minimum... it's gotta become a push button process.

          Show
          Michael McCandless added a comment - Putting pressure is fine, but just because being an RM is hard is not a King for a day pass IMO Putting pressure is precisely what Robert has done here (and on SOLR-2390 )? He's acting just like an RM should act, as far as I can tell. Moving an issue out, stating that an issue won't make the RC, is fair game. These are the normal "tools" of the RM... My point is, we all must expect/allow/not-get-upset-about this "pressure" from the RM – it comes with the territory, and it's the RM's right to be very aggressive in order to get the release done. Else releases will not get done, and we'll all keep one-more-thing'ing the release, and that's bad for all of us. We gotta remove the barriers to doing releases around here, not add to them. In fact, we should scrutinize our scary ReleaseTodo and pare it back to the bare minimum... it's gotta become a push button process.
          Hide
          Mark Miller added a comment -

          Worse, this whole situation (people getting angry at the RM for doing
          precisely what the RM is supposed to do) is a disincentive for
          future RMs to volunteer doing releases, thus causing even less
          frequent releases. It's already hard enough for us to get a release
          out as it is.

          I'm not sure I agree. Declaring that this will not get in is beyond the scope of an RM in my opinion. Putting pressure is fine, but just because being an RM is hard is not a King for a day pass IMO. It's up to the RM to build the release candidate from whatever issues he wants - does that mean he needs to man handle JIRA?

          The RM is supposed to be an asshole (not that Robert has acted like
          one, here, imho).

          I don't think he is? Too strong a word. I didn't even use it full to refer to the 2 actions I was commenting on. First, I said a acting like a bit of an asshole and I separated it a distance as theoretical. It was weak sauce commentary on the two actions I've pointed out:

          1. Just reverting a change another respected committer made immediately and without discussion - without too much investigation - because that issue was subsumed by another issue that had already been moved for 3.1 consideration and we where still discussing.

          2. Declaring that this will not make it in over ongoing discussion.

          S/he has full authority to draw the line, crack the
          whip, do whatever it takes to get the release out.

          Do whatever it takes? Come on...

          We all cannot question that, unless we want to step up and be the RM because it is NOT an easy job.

          I do question it. I have stepped up to be the RM, I know it's not an easy job, and I'll volunteer to do it again sometime.

          Robert is great at it - in general he has all my support in the world. I certainly understand the difficulties hes facing trying to point this release and he has my sympathies - and I wish he had more of my help. But that doesn't change my reaction to his actions. I feel the same way and I have the same responses to it. Meanwhile, I still like and respect Robert.

          Show
          Mark Miller added a comment - Worse, this whole situation (people getting angry at the RM for doing precisely what the RM is supposed to do) is a disincentive for future RMs to volunteer doing releases, thus causing even less frequent releases. It's already hard enough for us to get a release out as it is. I'm not sure I agree. Declaring that this will not get in is beyond the scope of an RM in my opinion. Putting pressure is fine, but just because being an RM is hard is not a King for a day pass IMO. It's up to the RM to build the release candidate from whatever issues he wants - does that mean he needs to man handle JIRA? The RM is supposed to be an asshole (not that Robert has acted like one, here, imho). I don't think he is? Too strong a word. I didn't even use it full to refer to the 2 actions I was commenting on. First, I said a acting like a bit of an asshole and I separated it a distance as theoretical. It was weak sauce commentary on the two actions I've pointed out: 1. Just reverting a change another respected committer made immediately and without discussion - without too much investigation - because that issue was subsumed by another issue that had already been moved for 3.1 consideration and we where still discussing. 2. Declaring that this will not make it in over ongoing discussion. S/he has full authority to draw the line, crack the whip, do whatever it takes to get the release out. Do whatever it takes? Come on... We all cannot question that, unless we want to step up and be the RM because it is NOT an easy job. I do question it. I have stepped up to be the RM, I know it's not an easy job, and I'll volunteer to do it again sometime. Robert is great at it - in general he has all my support in the world. I certainly understand the difficulties hes facing trying to point this release and he has my sympathies - and I wish he had more of my help. But that doesn't change my reaction to his actions. I feel the same way and I have the same responses to it. Meanwhile, I still like and respect Robert.
          Hide
          Michael McCandless added a comment -

          In order to release more often we have to stop this cycle of shoving things in at the last minute

          +1

          Dev is a constant thing around here and we keep holding back a release
          for the one-more-issue to get in we will never release.

          Our lack-of-release reflects badly on Lucene/Solr – the outside world
          uses this as the proxy for our health and we know we get bad marks.

          Worse, this whole situation (people getting angry at the RM for doing
          precisely what the RM is supposed to do) is a disincentive for
          future RMs to volunteer doing releases, thus causing even less
          frequent releases. It's already hard enough for us to get a release
          out as it is.

          The RM is supposed to be an asshole (not that Robert has acted like
          one, here, imho). S/he has full authority to draw the line, crack the
          whip, do whatever it takes to get the release out. We all cannot
          question that, unless we want to step up and be the RM because it is
          NOT an easy job.

          I think this issue should wait for 3.2.

          Show
          Michael McCandless added a comment - In order to release more often we have to stop this cycle of shoving things in at the last minute +1 Dev is a constant thing around here and we keep holding back a release for the one-more-issue to get in we will never release. Our lack-of-release reflects badly on Lucene/Solr – the outside world uses this as the proxy for our health and we know we get bad marks. Worse, this whole situation (people getting angry at the RM for doing precisely what the RM is supposed to do) is a disincentive for future RMs to volunteer doing releases, thus causing even less frequent releases. It's already hard enough for us to get a release out as it is. The RM is supposed to be an asshole (not that Robert has acted like one, here, imho). S/he has full authority to draw the line, crack the whip, do whatever it takes to get the release out. We all cannot question that, unless we want to step up and be the RM because it is NOT an easy job. I think this issue should wait for 3.2.
          Hide
          Mark Miller added a comment -

          we should not have shoved this in at the last minute,

          We didn't? Marking something as 3.1 is the best way to get it considered for last minute inclusion, blocker or not. It certainly doesn't mean its not going to be pushed back out after discussion.

          In any case, if you are not for it, that decides it - I'm not willing to do the work right now.

          So, here are my questions:

          1. I don't remember 2.9 probably.
          2. Because it's a completely different approach.

          It's been around for a while. I saw one guy that stayed on Solr 1.3 over 1.4 because of it. Most people will try fast vector and say oh nice, it's fast - but it doesn't highlight wildcard queries or these queries, etc. They either accept one bug over the other, or stick with an older version. Honestly, if that continues for another release, it's no skin off my nose. But neither are most bugs.

          Show
          Mark Miller added a comment - we should not have shoved this in at the last minute, We didn't? Marking something as 3.1 is the best way to get it considered for last minute inclusion, blocker or not. It certainly doesn't mean its not going to be pushed back out after discussion. In any case, if you are not for it, that decides it - I'm not willing to do the work right now. So, here are my questions: 1. I don't remember 2.9 probably. 2. Because it's a completely different approach. It's been around for a while. I saw one guy that stayed on Solr 1.3 over 1.4 because of it. Most people will try fast vector and say oh nice, it's fast - but it doesn't highlight wildcard queries or these queries, etc. They either accept one bug over the other, or stick with an older version. Honestly, if that continues for another release, it's no skin off my nose. But neither are most bugs.
          Hide
          Robert Muir added a comment -

          I think 3.2 is a good tradeoff, unless we introduced this slowdown in 3.1 (my earlier question).

          If we are introducing this slowdown in the 3.1 release, then I think its much more serious, and I would instead suggest we set the issue to blocker.

          Regardless I think there are some technical steps that can be taken to easy my mind about the patch, for example the TokenFilter here can be tested independently with BaseTokenStreamTestCase (this is good at catching reuse bugs like the one I hinted at).

          Show
          Robert Muir added a comment - I think 3.2 is a good tradeoff, unless we introduced this slowdown in 3.1 (my earlier question). If we are introducing this slowdown in the 3.1 release, then I think its much more serious, and I would instead suggest we set the issue to blocker. Regardless I think there are some technical steps that can be taken to easy my mind about the patch, for example the TokenFilter here can be tested independently with BaseTokenStreamTestCase (this is good at catching reuse bugs like the one I hinted at).
          Hide
          Grant Ingersoll added a comment -

          I think Robert's right, we should not have shoved this in at the last minute, even though it is a pretty big issue for those doing highlighting of larger documents. I'd say we just mark it as 3.1.1 or 3.2.

          Show
          Grant Ingersoll added a comment - I think Robert's right, we should not have shoved this in at the last minute, even though it is a pretty big issue for those doing highlighting of larger documents. I'd say we just mark it as 3.1.1 or 3.2.
          Hide
          Robert Muir added a comment -

          I mind your attitude. Changing the issue target 2 seconds after Grant with no discussion. Declaring on your own that it won't get in. Not trying to get to a real conversation about the issue (which you clearly don't fully understand if you think storing term vectors will help). These things are my issue, not any so called push back.

          Its not an attitude, and its not personal. Its trying to stop last minute stuff from being shoved into the release right before the RC, especially if its not fully-formed patches ready to be committed.

          Well man, you need us on your team too. Performance bug is a technical valid reason for a -1 on a release. I'm not threatening that - but I'm pointing out that everyone needs to be on board - not just the RM. Taking the time for fair discussion is not a waste of time.

          I totally agree with you here. But some people might say, if the bug has been aroudn since say 2.4 or 2.9 that its not critical that it be fixed in 3.1 at the last minute, and still +1 the release.

          As i stated earlier on this issue, I'm sympathetic to performance bugs: performance bugs are bugs too. But we need to evaluate risk-reward here.

          Just don't forget that there are other performance problems with large documents in lucene (some have been around a while) and we aren't trying to shove any last minute fixes for those in.

          So, here are my questions:

          1. What version of Lucene was this performance bug introduced in? Is it something we introduced in version 3.1? If this is the case its more serious than if its something thats been around since 2.9.
          2. Why is fast-vector highlighter with TVs "ok", but highlighter with TVs slow?
          Show
          Robert Muir added a comment - I mind your attitude. Changing the issue target 2 seconds after Grant with no discussion. Declaring on your own that it won't get in. Not trying to get to a real conversation about the issue (which you clearly don't fully understand if you think storing term vectors will help). These things are my issue, not any so called push back. Its not an attitude, and its not personal. Its trying to stop last minute stuff from being shoved into the release right before the RC, especially if its not fully-formed patches ready to be committed. Well man, you need us on your team too. Performance bug is a technical valid reason for a -1 on a release. I'm not threatening that - but I'm pointing out that everyone needs to be on board - not just the RM. Taking the time for fair discussion is not a waste of time. I totally agree with you here. But some people might say, if the bug has been aroudn since say 2.4 or 2.9 that its not critical that it be fixed in 3.1 at the last minute, and still +1 the release. As i stated earlier on this issue, I'm sympathetic to performance bugs: performance bugs are bugs too. But we need to evaluate risk-reward here. Just don't forget that there are other performance problems with large documents in lucene (some have been around a while) and we aren't trying to shove any last minute fixes for those in. So, here are my questions: What version of Lucene was this performance bug introduced in? Is it something we introduced in version 3.1? If this is the case its more serious than if its something thats been around since 2.9. Why is fast-vector highlighter with TVs "ok", but highlighter with TVs slow?
          Hide
          Mark Miller added a comment -

          What you don't seem to get is that I don't mind if you push back. I don't mind your position.

          I mind your attitude. Changing the issue target 2 seconds after Grant with no discussion. Declaring on your own that it won't get in. Not trying to get to a real conversation about the issue (which you clearly don't fully understand if you think storing term vectors will help). These things are my issue, not any so called push back.

          Well man, you need us on your team too. Performance bug is a technical valid reason for a -1 on a release. I'm not threatening that - but I'm pointing out that everyone needs to be on board - not just the RM. Taking the time for fair discussion is not a waste of time.

          Show
          Mark Miller added a comment - What you don't seem to get is that I don't mind if you push back. I don't mind your position. I mind your attitude. Changing the issue target 2 seconds after Grant with no discussion. Declaring on your own that it won't get in. Not trying to get to a real conversation about the issue (which you clearly don't fully understand if you think storing term vectors will help). These things are my issue, not any so called push back. Well man, you need us on your team too. Performance bug is a technical valid reason for a -1 on a release. I'm not threatening that - but I'm pointing out that everyone needs to be on board - not just the RM. Taking the time for fair discussion is not a waste of time.
          Hide
          Robert Muir added a comment -

          Not ramrodded by someone being a bit of an asshole.

          Tese things can often be discussed by more than a single person.

          Yes, anyone can can produce a release candidate of Lucene. But if its going to be me doing it, i've already set aside time (and coordinated with others) to make RC builds. So I'm going to push back on shoving in last minute changes.

          Well you can call it that, or someone trying to be a release manager that will actually get out a release in the next year.

          Bottom line: if you feel this change is really important, I respect your decision on that. But you should set the issue to blocker and be aware that the tradeoff likely means delaying the RC for a few weeks (unless someone else steps up to volunteer to produce an RC, which is fine!)

          Show
          Robert Muir added a comment - Not ramrodded by someone being a bit of an asshole. Tese things can often be discussed by more than a single person. Yes, anyone can can produce a release candidate of Lucene. But if its going to be me doing it, i've already set aside time (and coordinated with others) to make RC builds. So I'm going to push back on shoving in last minute changes. Well you can call it that, or someone trying to be a release manager that will actually get out a release in the next year. Bottom line: if you feel this change is really important, I respect your decision on that. But you should set the issue to blocker and be aware that the tradeoff likely means delaying the RC for a few weeks (unless someone else steps up to volunteer to produce an RC, which is fine!)
          Hide
          Mark Miller added a comment -

          In order to release more often we have to stop this cycle of shoving things in at the last minute.

          As always in Lucene land, these things should be taken case by case depending on the facts - the severity of the bug and its affect on the release. Tese things can often be discussed by more than a single person.

          Not ramrodded by someone being a bit of an asshole.

          Show
          Mark Miller added a comment - In order to release more often we have to stop this cycle of shoving things in at the last minute. As always in Lucene land, these things should be taken case by case depending on the facts - the severity of the bug and its affect on the release. Tese things can often be discussed by more than a single person. Not ramrodded by someone being a bit of an asshole.
          Hide
          Robert Muir added a comment -

          I'll leave it up to you guys...err, I mean robert, to decide what to do here.

          Sorry you feel this way... everyone says they want faster releases but doesn't want to take the appropriate steps to move towards a model that supports that.

          In order to release more often we have to stop this cycle of shoving things in at the last minute.

          Show
          Robert Muir added a comment - I'll leave it up to you guys...err, I mean robert, to decide what to do here. Sorry you feel this way... everyone says they want faster releases but doesn't want to take the appropriate steps to move towards a model that supports that. In order to release more often we have to stop this cycle of shoving things in at the last minute.
          Hide
          Mark Miller added a comment -

          Anyhow, all I have time for on this today.

          I'll leave it up to you guys...err, I mean robert, to decide what to do here.

          Show
          Mark Miller added a comment - Anyhow, all I have time for on this today. I'll leave it up to you guys...err, I mean robert, to decide what to do here.
          Hide
          Mark Miller added a comment - - edited

          Right, but you can use term vectors with this highlighter too right?
          This issue only seems to refer to the case where you have no term vectors and analyze the text at runtime...

          Nope, that's not true. If you turn on term vectors, that does NOT solve this bug.

          Show
          Mark Miller added a comment - - edited Right, but you can use term vectors with this highlighter too right? This issue only seems to refer to the case where you have no term vectors and analyze the text at runtime... Nope, that's not true. If you turn on term vectors, that does NOT solve this bug.
          Hide
          Mark Miller added a comment -

          Grant: in terms of the back compat issue - I'm not really worried about it myself since this is contrib and we have changed these interfaces before with no complaint -

          but another tmp option is to special case and do an instanceOf check on the Scorer - and if its our QueryScorer, cast and set the max chars to analyze.

          It's not as pretty, but it avoids the method sig change.

          Show
          Mark Miller added a comment - Grant: in terms of the back compat issue - I'm not really worried about it myself since this is contrib and we have changed these interfaces before with no complaint - but another tmp option is to special case and do an instanceOf check on the Scorer - and if its our QueryScorer, cast and set the max chars to analyze. It's not as pretty, but it avoids the method sig change.
          Hide
          Robert Muir added a comment -

          Depends on what you mean by options - FastVectorHighlighter cannot highlight half our queries (multi-term last I knew, or Span) - trade one bug for anther.

          Right, but you can use term vectors with this highlighter too right?
          This issue only seems to refer to the case where you have no term vectors and analyze the text at runtime...

          I don't think its too much to say 'index your content according to what you are going to need'

          Show
          Robert Muir added a comment - Depends on what you mean by options - FastVectorHighlighter cannot highlight half our queries (multi-term last I knew, or Span) - trade one bug for anther. Right, but you can use term vectors with this highlighter too right? This issue only seems to refer to the case where you have no term vectors and analyze the text at runtime... I don't think its too much to say 'index your content according to what you are going to need'
          Hide
          Grant Ingersoll added a comment -

          I'm OK either way, but it does seem like a pretty big performance bug.

          Show
          Grant Ingersoll added a comment - I'm OK either way, but it does seem like a pretty big performance bug.
          Hide
          Mark Miller added a comment -

          Given the back compat breaks in the API, are we sure we should try to shove this into 3.1?

          I won't do the work, so whatever form my perspective...

          the user is hardly left without options.

          Depends on what you mean by options - FastVectorHighlighter cannot highlight half our queries (multi-term last I knew, or Span) - trade one bug for anther.

          Show
          Mark Miller added a comment - Given the back compat breaks in the API, are we sure we should try to shove this into 3.1? I won't do the work, so whatever form my perspective... the user is hardly left without options. Depends on what you mean by options - FastVectorHighlighter cannot highlight half our queries (multi-term last I knew, or Span) - trade one bug for anther.
          Hide
          Robert Muir added a comment -

          Given the back compat breaks in the API, are we sure we should try to shove this into 3.1?

          I am sympathetic to performance bugs, BUT it seems that one could use TermVectors and FastVectorHighlighter for these large documents, the user is hardly left without options.

          As a safer alternative we can document the issue in CHANGES.txt and recommend that users take that approach for large documents, and take our time and fix for 3.2

          Show
          Robert Muir added a comment - Given the back compat breaks in the API, are we sure we should try to shove this into 3.1? I am sympathetic to performance bugs, BUT it seems that one could use TermVectors and FastVectorHighlighter for these large documents, the user is hardly left without options. As a safer alternative we can document the issue in CHANGES.txt and recommend that users take that approach for large documents, and take our time and fix for 3.2
          Hide
          Robert Muir added a comment -

          I see what you mean now - though I still don't understand your previous comment.
          I assume that it's just defaulting to 0 - 0 now?

          Only the first time.

          But imagine you try to reuse this tokenstream (maybe its not being reused now, but in the future)... the values for the last token of the previous doc are say 10 - 5... the consumer calls reset(Reader) with new document and reset(), which clears your accumulator, but this attribute is still 10 - 5 until input.incrementToken()... only then does the tokenizer update the values.

          Show
          Robert Muir added a comment - I see what you mean now - though I still don't understand your previous comment. I assume that it's just defaulting to 0 - 0 now? Only the first time. But imagine you try to reuse this tokenstream (maybe its not being reused now, but in the future)... the values for the last token of the previous doc are say 10 - 5... the consumer calls reset(Reader) with new document and reset(), which clears your accumulator, but this attribute is still 10 - 5 until input.incrementToken()... only then does the tokenizer update the values.
          Hide
          Mark Miller added a comment -

          I can backport if you want.

          +1

          I don't think this is good practice to work with the uninitialized values.

          I see what you mean now - though I still don't understand your previous comment.
          I assume that it's just defaulting to 0 - 0 now?

          Yeah, that could be changed.

          Show
          Mark Miller added a comment - I can backport if you want. +1 I don't think this is good practice to work with the uninitialized values. I see what you mean now - though I still don't understand your previous comment. I assume that it's just defaulting to 0 - 0 now? Yeah, that could be changed.
          Hide
          Mark Miller added a comment -

          This includes the change to the test to make it compile.

          Still no Changes entry.

          The compile change to the test is a back compat break. The Scorer needs to know the maxCharsToAnalyze setting.

          Have not had time to consider further yet.

          Show
          Mark Miller added a comment - This includes the change to the test to make it compile. Still no Changes entry. The compile change to the test is a back compat break. The Scorer needs to know the maxCharsToAnalyze setting. Have not had time to consider further yet.
          Hide
          Robert Muir added a comment -

          Exactly, so what is the attributes values before calling input.incrementToken() ?

          I don't think this is good practice to work with the uninitialized values.

          Show
          Robert Muir added a comment - Exactly, so what is the attributes values before calling input.incrementToken() ? I don't think this is good practice to work with the uninitialized values.
          Hide
          Mark Miller added a comment -

          i think the offsetLength calculation needs to be inside the incrementToken?

          I do not follow ... incrementToken is:

          + @Override
          + public boolean incrementToken() throws IOException {
          + int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset();
          + if (offsetCount < offsetLimit && input.incrementToken())

          { + offsetCount += offsetLength; + return true; + }

          + return false;
          + }

          Show
          Mark Miller added a comment - i think the offsetLength calculation needs to be inside the incrementToken? I do not follow ... incrementToken is: + @Override + public boolean incrementToken() throws IOException { + int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset(); + if (offsetCount < offsetLimit && input.incrementToken()) { + offsetCount += offsetLength; + return true; + } + return false; + }
          Hide
          Grant Ingersoll added a comment -

          I can backport if you want.

          Show
          Grant Ingersoll added a comment - I can backport if you want.
          Hide
          Robert Muir added a comment -

          i think the offsetLength calculation needs to be inside the incrementToken?

          Honestly, if I was not so busy, I'd say we should really get this in for 3.1.

          yeah, performance bugs are bugs too.

          Show
          Robert Muir added a comment - i think the offsetLength calculation needs to be inside the incrementToken? Honestly, if I was not so busy, I'd say we should really get this in for 3.1. yeah, performance bugs are bugs too.
          Hide
          Mark Miller added a comment -

          P.S. One that is really a bad bug in my mind - we switched this to be the default and the old Highlighter did not suffer like this in these situations.

          Looking back over the email archives, it bit more than a few people. I'm pretty sure this bug was the impetus of the Fast Vector Highlighter (which is still valuable if you really do want to highlight over every token in your 3 billion word PDF file ).

          You pay this huge perf penalty for no gain and no reason. If you are talking wikipedia size docs, it won't affect you - but for long documents, doing 10 snippets can be prohibitive, with no workaround. That is not a friendly neighborhood highlighter.

          Show
          Mark Miller added a comment - P.S. One that is really a bad bug in my mind - we switched this to be the default and the old Highlighter did not suffer like this in these situations. Looking back over the email archives, it bit more than a few people. I'm pretty sure this bug was the impetus of the Fast Vector Highlighter (which is still valuable if you really do want to highlight over every token in your 3 billion word PDF file ). You pay this huge perf penalty for no gain and no reason. If you are talking wikipedia size docs, it won't affect you - but for long documents, doing 10 snippets can be prohibitive, with no workaround. That is not a friendly neighborhood highlighter.
          Hide
          Mark Miller added a comment -

          Honestly, if I was not so busy, I'd say we should really get this in for 3.1.

          If you are doing something like desktop search, this can be a really cruel highlighter perf problem.

          Show
          Mark Miller added a comment - Honestly, if I was not so busy, I'd say we should really get this in for 3.1. If you are doing something like desktop search, this can be a really cruel highlighter perf problem.
          Hide
          Mark Miller added a comment -

          My last patch is missing a couple required test compile changes - I excluded that class cause I had some test code in it.

          I'll put up a new patch as soon as I get a chance with the test class changes (Scorer init method gets a new param and there are a couple anonymous impls in test)

          Show
          Mark Miller added a comment - My last patch is missing a couple required test compile changes - I excluded that class cause I had some test code in it. I'll put up a new patch as soon as I get a chance with the test class changes (Scorer init method gets a new param and there are a couple anonymous impls in test)
          Hide
          Mark Miller added a comment -

          The other problem was that CachingTokenFilter was exhausting the entire stream eagerly - which could be a spin through a very large TokenStream - uselessly if a user has set the maxDocCharOffset setting.

          This and adding the whole stream to the MemoryIndex was a very large performance bug in the span highlighter for some time now.

          In my test case, using Solr's DEFAULT_MAX_CHARS_TO_ANALYZE = 50*1024, highlighting 10 very large PDF docs I have dropped from 20 some seconds to 300ms.

          New patch with some fixes and cleanup. I don't see the above error with a more correct TokenFilter impl.

          Show
          Mark Miller added a comment - The other problem was that CachingTokenFilter was exhausting the entire stream eagerly - which could be a spin through a very large TokenStream - uselessly if a user has set the maxDocCharOffset setting. This and adding the whole stream to the MemoryIndex was a very large performance bug in the span highlighter for some time now. In my test case, using Solr's DEFAULT_MAX_CHARS_TO_ANALYZE = 50*1024, highlighting 10 very large PDF docs I have dropped from 20 some seconds to 300ms. New patch with some fixes and cleanup. I don't see the above error with a more correct TokenFilter impl.
          Hide
          Mark Miller added a comment -

          Fair enough (this setting is well before my time AFAIK) - but not my intent with this issue - which is just to fix this little perf bug.

          Show
          Mark Miller added a comment - Fair enough (this setting is well before my time AFAIK) - but not my intent with this issue - which is just to fix this little perf bug.
          Hide
          Robert Muir added a comment -

          Maybe the setting is already there, but I think we should remove it: I don't think its the best measure.

          We can instead replace it with a max # tokens setting, which is more intuitive, easier to implement,
          and consistent with how other things are limited (e.g. the old IW setting and the new
          LimitTokenCountFilter).

          Show
          Robert Muir added a comment - Maybe the setting is already there, but I think we should remove it: I don't think its the best measure. We can instead replace it with a max # tokens setting, which is more intuitive, easier to implement, and consistent with how other things are limited (e.g. the old IW setting and the new LimitTokenCountFilter).
          Hide
          Mark Miller added a comment -

          Because this is already a setting on the Highlighter that appears to work by offset?

          Show
          Mark Miller added a comment - Because this is already a setting on the Highlighter that appears to work by offset?
          Hide
          Robert Muir added a comment -

          i don't know why you get this null pointer exception (maybe you triggered a bug), but...

          just a quick glance:

          1. why use offsets for this calculation? This seems a bit dangerous versus other approaches.
          2. either way, the reset() method should clear any state such as counters in the tokenstream.

          As far as what i meant above... the whole maxDocCharsToAnalyze seems like the wrong measure.
          Why not specify this just as max tokens, and use LimitTokenCountAnalyzer, which is already implemented.

          using arbitrary chars and offsets is going to create fake tokens (e.g. truncate words) and other problems.
          besides, its not unicode safe since a codepoint might span multiple chars.

          Show
          Robert Muir added a comment - i don't know why you get this null pointer exception (maybe you triggered a bug), but... just a quick glance: why use offsets for this calculation? This seems a bit dangerous versus other approaches. either way, the reset() method should clear any state such as counters in the tokenstream. As far as what i meant above... the whole maxDocCharsToAnalyze seems like the wrong measure. Why not specify this just as max tokens, and use LimitTokenCountAnalyzer, which is already implemented. using arbitrary chars and offsets is going to create fake tokens (e.g. truncate words) and other problems. besides, its not unicode safe since a codepoint might span multiple chars.
          Hide
          Mark Miller added a comment -

          only seems to happen when maxDocCharsToAnalyze is absurdly low - like 5

          Show
          Mark Miller added a comment - only seems to happen when maxDocCharsToAnalyze is absurdly low - like 5
          Hide
          Mark Miller added a comment -

          hmmm...didn't quite get it right yet I think...

          q=java man news th*

          java.lang.NullPointerException
          	at org.apache.lucene.util.CharacterUtils$Java5CharacterUtils.fill(CharacterUtils.java:181)
          	at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:150)
          	at org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter.incrementToken(WordDelimiterFilter.java:224)
          	at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
          	at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:71)
          	at com.ACME.analysis.ACMEPluralStemFilter.incrementToken(ACMEPluralStemFilter.java:56)
          	at org.apache.solr.highlight.TokenOrderingFilter.incrementToken(DefaultSolrHighlighter.java:575)
          	at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78)
          	at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50)
          	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:220)
          
          
          Show
          Mark Miller added a comment - hmmm...didn't quite get it right yet I think... q=java man news th* java.lang.NullPointerException at org.apache.lucene.util.CharacterUtils$Java5CharacterUtils.fill(CharacterUtils.java:181) at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:150) at org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter.incrementToken(WordDelimiterFilter.java:224) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:71) at com.ACME.analysis.ACMEPluralStemFilter.incrementToken(ACMEPluralStemFilter.java:56) at org.apache.solr.highlight.TokenOrderingFilter.incrementToken(DefaultSolrHighlighter.java:575) at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78) at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:220)
          Hide
          Mark Miller added a comment -

          I'm a little rusty on the new tokenstream api, but here is a little test patch I popped out real quick

          Show
          Mark Miller added a comment - I'm a little rusty on the new tokenstream api, but here is a little test patch I popped out real quick

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development