Lucene - Core
  1. Lucene - Core
  2. LUCENE-4317

Field.java does not reuse its inlined Keyword-TokenStream

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.0, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Field.java contains a inlined Keyword-TokenStream. Unfortunately this one is recreated all the time, although one reuses the same Field instance. For NumericTokenStream Field.java reuses it, but the Keyword one not.

      We should apply the same logic and lazy init the TokenStream with a setter for the String value and reset(). This would be looking identical to SetNumeric(xx).

      1. LUCENE-4317.patch
        7 kB
        Uwe Schindler
      2. LUCENE-4317-2.patch
        5 kB
        Uwe Schindler
      3. LUCENE-4317-2.patch
        3 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Patch:

          • Streamlines handling of NumericTokenStreanm and a new internal StringTokenStream. Previous code was not easy to read.
          • Reuse of both (of course, not accross field instances). This improves the performance of reused Field instances enormous, as creating a new TokenStream for each small String is heavy (2 LinkedHashMaps, addition of attributes,...)

          We should maybe think about a solution how to "cache" the instances across several instances (like IndexWriter did in 3.x). I was thinking about a singleton Analyzer...

          For now this patch helps a lot.

          Show
          Uwe Schindler added a comment - Patch: Streamlines handling of NumericTokenStreanm and a new internal StringTokenStream. Previous code was not easy to read. Reuse of both (of course, not accross field instances). This improves the performance of reused Field instances enormous, as creating a new TokenStream for each small String is heavy (2 LinkedHashMaps, addition of attributes,...) We should maybe think about a solution how to "cache" the instances across several instances (like IndexWriter did in 3.x). I was thinking about a singleton Analyzer... For now this patch helps a lot.
          Hide
          Uwe Schindler added a comment -

          This patch also fixes StringTokenStream to correctly clearAttributes()! It also adds end().

          Show
          Uwe Schindler added a comment - This patch also fixes StringTokenStream to correctly clearAttributes()! It also adds end().
          Hide
          Chris Male added a comment -

          Thanks for tackling this Uwe.

          We should maybe think about a solution how to "cache" the instances across several instances (like IndexWriter did in 3.x)

          Do you mean also for NumericTokenStream? or just StringTokenStream?

          Show
          Chris Male added a comment - Thanks for tackling this Uwe. We should maybe think about a solution how to "cache" the instances across several instances (like IndexWriter did in 3.x) Do you mean also for NumericTokenStream? or just StringTokenStream?
          Hide
          Uwe Schindler added a comment -

          Do you mean also for NumericTokenStream? or just StringTokenStream?

          Both, of course. A plain Analyzer will not help, as it expects a Reader And creating a StringReader for every String is also a bad idea.

          I was thinking about some Analyzer-Like CloseableThreadLocal, but the problem is, we can never close it. IndexWriter previously managed to close the ThreadLocal (for its StringTokenStream-like internal AS), but thats no longer possible here.

          Show
          Uwe Schindler added a comment - Do you mean also for NumericTokenStream? or just StringTokenStream? Both, of course. A plain Analyzer will not help, as it expects a Reader And creating a StringReader for every String is also a bad idea. I was thinking about some Analyzer-Like CloseableThreadLocal, but the problem is, we can never close it. IndexWriter previously managed to close the ThreadLocal (for its StringTokenStream-like internal AS), but thats no longer possible here.
          Hide
          Robert Muir added a comment -

          We should maybe think about a solution how to "cache" the instances across several instances (like IndexWriter did in 3.x). I was thinking about a singleton Analyzer...

          This is really simple: remove StringField, use KeywordAnalyzer if you want that.

          Nobody gets confused with the StringField^H^H^H^NOT_ANALYZED trap anymore (since passing their same Analyzer used at indexing time to QueryParser nets the same results), there are not two different "incomplete" declarations of the analysis chain (1. Field, 2. Analyzer) to manage, instead just Analyzer, and you get reuse/closing/etc.

          Show
          Robert Muir added a comment - We should maybe think about a solution how to "cache" the instances across several instances (like IndexWriter did in 3.x). I was thinking about a singleton Analyzer... This is really simple: remove StringField, use KeywordAnalyzer if you want that. Nobody gets confused with the StringField^H^H^H^NOT_ANALYZED trap anymore (since passing their same Analyzer used at indexing time to QueryParser nets the same results), there are not two different "incomplete" declarations of the analysis chain (1. Field, 2. Analyzer) to manage, instead just Analyzer, and you get reuse/closing/etc.
          Hide
          Chris Male added a comment -

          Yeah I agree with Robert, using KeywordAnalyzer instead of having StringField would be more consistent, reduce a lot of confusion, and give us reuse.

          Show
          Chris Male added a comment - Yeah I agree with Robert, using KeywordAnalyzer instead of having StringField would be more consistent, reduce a lot of confusion, and give us reuse.
          Hide
          Uwe Schindler added a comment -

          Do we agree to fix the current problem and think about that StringField removal later? I disagree here, but that's another problem. In addition, the "Robert solution" still does not solve the numeric problem of recreating NumericTokenStreams over an over.

          Is it ok to commit the attached patch and discuss afterwards?

          Show
          Uwe Schindler added a comment - Do we agree to fix the current problem and think about that StringField removal later? I disagree here, but that's another problem. In addition, the "Robert solution" still does not solve the numeric problem of recreating NumericTokenStreams over an over. Is it ok to commit the attached patch and discuss afterwards?
          Hide
          Robert Muir added a comment -

          In addition, the "Robert solution" still does not solve the numeric problem of recreating NumericTokenStreams over an over.

          It could: the problem here is the same, that it uses a special tokenstream instead of being a normal one integrated into the Analyzer.
          if it was fixed in the same way, it could have better QP support etc out of box as well (e.g. numeric queries work by default or whatever).

          Show
          Robert Muir added a comment - In addition, the "Robert solution" still does not solve the numeric problem of recreating NumericTokenStreams over an over. It could: the problem here is the same, that it uses a special tokenstream instead of being a normal one integrated into the Analyzer. if it was fixed in the same way, it could have better QP support etc out of box as well (e.g. numeric queries work by default or whatever).
          Hide
          Uwe Schindler added a comment -

          The problem with numeric is that they are no strings, so it different. So NumericTokenStream is a plain TokenStream and no Tokenizer.

          But that does not answer my question in my last comment.

          Show
          Uwe Schindler added a comment - The problem with numeric is that they are no strings, so it different. So NumericTokenStream is a plain TokenStream and no Tokenizer. But that does not answer my question in my last comment.
          Hide
          Chris Male added a comment -

          I'm +1 to committing this as a first step improvement but we should definitely pursue these other suggestions as well.

          Show
          Chris Male added a comment - I'm +1 to committing this as a first step improvement but we should definitely pursue these other suggestions as well.
          Hide
          Robert Muir added a comment -

          The problem with numeric is that they are no strings, so it different. So NumericTokenStream is a plain TokenStream and no Tokenizer.

          But there is, at query time. it could be a tokenizer also with its setInt() etc methods for easy use at index time... just ideas.

          Show
          Robert Muir added a comment - The problem with numeric is that they are no strings, so it different. So NumericTokenStream is a plain TokenStream and no Tokenizer. But there is, at query time. it could be a tokenizer also with its setInt() etc methods for easy use at index time... just ideas.
          Hide
          Uwe Schindler added a comment -

          To come back to Robert: We can of course remove the whole FieldTypes stuff altogether and only do this in Analyzer (which is then somehow a schema replacement). But then Analyzer.tokenStream() should be able to take other data types than Reader. If this would be the case I would +1 all your suggestions, izt would also solve the stupid wrapping of Strings with StringReader (that would be needed with removal of StringField)

          Btw.: As Solr is working on Strings solely, Solr can include the numerics into analyzers (and they do), but this adds "locale specific" String to numeric translation to the Tokenizer (which solr adds). I dont want this as the data sources for numeric fields are in most cases no strings.

          Show
          Uwe Schindler added a comment - To come back to Robert: We can of course remove the whole FieldTypes stuff altogether and only do this in Analyzer (which is then somehow a schema replacement). But then Analyzer.tokenStream() should be able to take other data types than Reader. If this would be the case I would +1 all your suggestions, izt would also solve the stupid wrapping of Strings with StringReader (that would be needed with removal of StringField) Btw.: As Solr is working on Strings solely, Solr can include the numerics into analyzers (and they do), but this adds "locale specific" String to numeric translation to the Tokenizer (which solr adds). I dont want this as the data sources for numeric fields are in most cases no strings .
          Hide
          Uwe Schindler added a comment -

          But there is, at query time. it could be a tokenizer also with its setInt() etc methods for easy use at index time... just ideas.

          I dont care about QueryParser, because when you indexed a numeric field, you should create a Query on your own and not use this query parser, that corrumpts everything because of its syntax and whitespace handling.

          Show
          Uwe Schindler added a comment - But there is, at query time. it could be a tokenizer also with its setInt() etc methods for easy use at index time... just ideas. I dont care about QueryParser, because when you indexed a numeric field, you should create a Query on your own and not use this query parser, that corrumpts everything because of its syntax and whitespace handling.
          Hide
          Robert Muir added a comment -

          To come back to Robert: We can of course remove the whole FieldTypes stuff altogether and only do this in Analyzer (which is then somehow a schema replacement). But then Analyzer.tokenStream() should be able to take other data types than Reader. If this would be the case I would +1 all your suggestions, izt would also solve the stupid wrapping of Strings with StringReader (that would be needed with removal of StringField)

          Well its sorta always been this way, we tell people to check and make sure they use same analysis at index and query time. But with the current situation if they use things like StringFields and QueryParser they follow our best practices and get bogus or no results, thats why I get frustrated (this NOT_ANALYZED confusion has come up on the ML many times)

          By the way as far as taking other data types than reader, i have no idea, i certainly feel like if we want to take String we could just apply StringReader ourself by default (someone could override). The problem with taking something other than reader would be the existence of charfilters in the chain (as these work on only readers).

          dont care about QueryParser, because when you indexed a numeric field, you should create a Query on your own and not use this query parser, that corrumpts everything because of its syntax and whitespace handling.

          But you added support for this to the flexible QP last summer as part of GSOC right (LUCENE-1768) ? so its possible with Lucene's syntax? I'm confused I guess.

          Show
          Robert Muir added a comment - To come back to Robert: We can of course remove the whole FieldTypes stuff altogether and only do this in Analyzer (which is then somehow a schema replacement). But then Analyzer.tokenStream() should be able to take other data types than Reader. If this would be the case I would +1 all your suggestions, izt would also solve the stupid wrapping of Strings with StringReader (that would be needed with removal of StringField) Well its sorta always been this way, we tell people to check and make sure they use same analysis at index and query time. But with the current situation if they use things like StringFields and QueryParser they follow our best practices and get bogus or no results, thats why I get frustrated (this NOT_ANALYZED confusion has come up on the ML many times) By the way as far as taking other data types than reader, i have no idea, i certainly feel like if we want to take String we could just apply StringReader ourself by default (someone could override). The problem with taking something other than reader would be the existence of charfilters in the chain (as these work on only readers). dont care about QueryParser, because when you indexed a numeric field, you should create a Query on your own and not use this query parser, that corrumpts everything because of its syntax and whitespace handling. But you added support for this to the flexible QP last summer as part of GSOC right ( LUCENE-1768 ) ? so its possible with Lucene's syntax? I'm confused I guess.
          Hide
          Robert Muir added a comment -

          by the way, of course I am +1 to commit the patch on this issue. If someone is going through the trouble to reuse fields we should not be making mounds of tokenstreams.

          Show
          Robert Muir added a comment - by the way, of course I am +1 to commit the patch on this issue. If someone is going through the trouble to reuse fields we should not be making mounds of tokenstreams.
          Hide
          Michael McCandless added a comment -

          +1 to the current patch and separately debating how to handle the long-standing NOT_ANALYZED trap.

          I do think StringField serves an important purpose though (it sets DOCS_ONLY, turns of norms), in addition to not analyzing. But I don't like that it has its own little analyzer/tokenstream inside: it would be much better to simply use KeywordAnalyzer.

          Or (crazy idea): maybe we could simply call on the analyzer (like we do for normal tokenized fields), but then insist what was returned is in fact from KeywordAnalyzer? This would force users of StringField to use PFAW w/ this field mapping to KeywordAnalyzer. It's rather... anal though. And will be confusing to users who "forget" to use PFAW (but then this is a service to them: it points out that at query-time their analysis is wrong). Advanced users are free to use Field directly if somehow this checking becomes a problem ...

          Show
          Michael McCandless added a comment - +1 to the current patch and separately debating how to handle the long-standing NOT_ANALYZED trap. I do think StringField serves an important purpose though (it sets DOCS_ONLY, turns of norms), in addition to not analyzing. But I don't like that it has its own little analyzer/tokenstream inside: it would be much better to simply use KeywordAnalyzer. Or (crazy idea): maybe we could simply call on the analyzer (like we do for normal tokenized fields), but then insist what was returned is in fact from KeywordAnalyzer? This would force users of StringField to use PFAW w/ this field mapping to KeywordAnalyzer. It's rather... anal though. And will be confusing to users who "forget" to use PFAW (but then this is a service to them: it points out that at query-time their analysis is wrong). Advanced users are free to use Field directly if somehow this checking becomes a problem ...
          Hide
          Uwe Schindler added a comment -

          Mike: Or (crazy idea): maybe we could simply call on the analyzer (like we do for normal tokenized fields), but then insist what was returned is in fact from KeywordAnalyzer? This would force users of StringField to use PFAW w/ this field mapping to KeywordAnalyzer. It's rather... anal though. And will be confusing to users who "forget" to use PFAW (but then this is a service to them: it points out that at query-time their analysis is wrong). Advanced users are free to use Field directly if somehow this checking becomes a problem ...

          I disagree with that, this is too much magic and does not help with numerics again

          Robert: i certainly feel like if we want to take String we could just apply StringReader ourself by default (someone could override).

          +1. Just another idea: The Analyzer can internally also reuse a ReuseableStringReader like IndexWriter did in 3.x.

          Please lets discuss all this in a separate issue and maybe for 5.0. Now it is too late for that (in my opinion).

          Show
          Uwe Schindler added a comment - Mike: Or (crazy idea): maybe we could simply call on the analyzer (like we do for normal tokenized fields), but then insist what was returned is in fact from KeywordAnalyzer? This would force users of StringField to use PFAW w/ this field mapping to KeywordAnalyzer. It's rather... anal though. And will be confusing to users who "forget" to use PFAW (but then this is a service to them: it points out that at query-time their analysis is wrong). Advanced users are free to use Field directly if somehow this checking becomes a problem ... I disagree with that, this is too much magic and does not help with numerics again Robert: i certainly feel like if we want to take String we could just apply StringReader ourself by default (someone could override). +1. Just another idea: The Analyzer can internally also reuse a ReuseableStringReader like IndexWriter did in 3.x. Please lets discuss all this in a separate issue and maybe for 5.0. Now it is too late for that (in my opinion).
          Hide
          Uwe Schindler added a comment -

          I wanted to add: The StringTokenStream in Field.java has one good thing: It does not copy too often. The String is used directly to populate CTA. With KeywordAnalyzer you do crazy per-char loops for nothing. In general we should maybe also add new KeywordTokenizer(String), setString(String) for easy use.

          Show
          Uwe Schindler added a comment - I wanted to add: The StringTokenStream in Field.java has one good thing: It does not copy too often. The String is used directly to populate CTA. With KeywordAnalyzer you do crazy per-char loops for nothing. In general we should maybe also add new KeywordTokenizer(String), setString(String) for easy use.
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1375507
          Committed 4.x revision: 1375508

          If you have ideas how to solve the "internal TokenStream" issue, open a new issue and discuss it there. I have no much interesta as I am happy with the current design. This issue was just to improve the current situation, not change it completely.

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1375507 Committed 4.x revision: 1375508 If you have ideas how to solve the "internal TokenStream" issue, open a new issue and discuss it there. I have no much interesta as I am happy with the current design. This issue was just to improve the current situation, not change it completely.
          Hide
          Uwe Schindler added a comment -

          I reopen this one to fix a second reuse problem:

          • It does not reuse the StringReader like Lucene 3.x did. If you look at the source code of StringReader in JDK, you are afraid of all synchronization and cost of initialization. In Lucene 3.x we had ReuseableStringReader, I will revive that in Field.java
          • Also the AttributeSource keeps a hard reference to the String. It should free the String on close. The above ReusableStringReader does that.
          Show
          Uwe Schindler added a comment - I reopen this one to fix a second reuse problem: It does not reuse the StringReader like Lucene 3.x did. If you look at the source code of StringReader in JDK, you are afraid of all synchronization and cost of initialization. In Lucene 3.x we had ReuseableStringReader, I will revive that in Field.java Also the AttributeSource keeps a hard reference to the String. It should free the String on close. The above ReusableStringReader does that.
          Hide
          Uwe Schindler added a comment -

          Patch. It refactors ReusableStringReader a bit and also adds support for read(), because thats heavy (new char[1] on every call). It removes useless read(char[]) instead as this is just delegation in Reader.java.
          I will create a small test to be 100% sure that it's correct and then commit.

          Show
          Uwe Schindler added a comment - Patch. It refactors ReusableStringReader a bit and also adds support for read(), because thats heavy (new char [1] on every call). It removes useless read(char[]) instead as this is just delegation in Reader.java. I will create a small test to be 100% sure that it's correct and then commit.
          Hide
          Robert Muir added a comment -

          Looks good, but i think we only need to set s to null in close(), since Tokenizer.java closes the reader anyway.

          Show
          Robert Muir added a comment - Looks good, but i think we only need to set s to null in close(), since Tokenizer.java closes the reader anyway.
          Hide
          Uwe Schindler added a comment -

          Patch with test added. Setting to null before or after close does not matter, as it will never NPE.

          I will commit soon!

          Show
          Uwe Schindler added a comment - Patch with test added. Setting to null before or after close does not matter, as it will never NPE. I will commit soon!
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision 1376261, 4.x revision 1376265

          Show
          Uwe Schindler added a comment - Committed trunk revision 1376261, 4.x revision 1376265
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development