Lucene - Core
  1. Lucene - Core
  2. LUCENE-6874

WhitespaceTokenizer should tokenize on NBSP

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:

      It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')

      Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?

      I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?

      1. icu-datasucker.patch
        2 kB
        Uwe Schindler
      2. LUCENE_6874_jflex.patch
        49 kB
        David Smiley
      3. LUCENE-6874.patch
        8 kB
        Uwe Schindler
      4. LUCENE-6874-chartokenizer.patch
        27 kB
        Uwe Schindler
      5. LUCENE-6874-chartokenizer.patch
        27 kB
        Uwe Schindler
      6. LUCENE-6874-chartokenizer.patch
        23 kB
        Uwe Schindler
      7. LUCENE-6874-jflex.patch
        56 kB
        Steve Rowe
      8. unicode-ws-tokenizer.patch
        13 kB
        Uwe Schindler
      9. unicode-ws-tokenizer.patch
        13 kB
        Uwe Schindler
      10. unicode-ws-tokenizer.patch
        10 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -

          Depends what you consider a trap.

          A non-breakable whitespace could be a legitimate way to prevent two tokens from being separated if they need to be tokenized together. An example that comes to my mind is the special "zero-width" space or the hyphenation marker... which even on its own poses a problem [1]...

          Ultimately it should be probably the question of whether we want to tokenize on "whitespace as in formatted text" or "whitespace as in logical codepoint units" and it doesn't apply to the WhitespaceTokenizer only, but to any tokenizer in general?

          I think WhitespaceTokenizer should tokenize on this.

          Seems like majority of people would want it to be tokenized, I agree. But if you change this then there is no way to go back to previous behavior. Currently it's relatively easy to wrap your input in a reader that replaces those problematic codepoints on the fly before they're fed to the tokenizer?

          [1] https://www.cs.tut.fi/~jkorpela/shy.html

          Show
          Dawid Weiss added a comment - Depends what you consider a trap. A non-breakable whitespace could be a legitimate way to prevent two tokens from being separated if they need to be tokenized together. An example that comes to my mind is the special "zero-width" space or the hyphenation marker... which even on its own poses a problem [1] ... Ultimately it should be probably the question of whether we want to tokenize on "whitespace as in formatted text" or "whitespace as in logical codepoint units" and it doesn't apply to the WhitespaceTokenizer only, but to any tokenizer in general? I think WhitespaceTokenizer should tokenize on this. Seems like majority of people would want it to be tokenized, I agree. But if you change this then there is no way to go back to previous behavior. Currently it's relatively easy to wrap your input in a reader that replaces those problematic codepoints on the fly before they're fed to the tokenizer? [1] https://www.cs.tut.fi/~jkorpela/shy.html
          Hide
          Adrien Grand added a comment - - edited

          So maybe we should solve this problem by adding some documentation?

          Show
          Adrien Grand added a comment - - edited So maybe we should solve this problem by adding some documentation?
          Hide
          Dawid Weiss added a comment -

          Any improvement to the docs that clarify what the software does would be great

          Show
          Dawid Weiss added a comment - Any improvement to the docs that clarify what the software does would be great
          Hide
          Uwe Schindler added a comment -

          My personal opinion on this:

          • The thing is called WhitespaceTokenizer, so it should do what the name says (split on isWhitespace).
          • If we want something else, maybe provide a separate CharTokenizer implementation that also splits on NBSP

          In general, whitespace tokenizer is not used for "classical" fulltext. For this type of text one would better use StandardTokenizer, ICU's Tokenizers or the language specific ones for Chinese or Japan. People using WhitespaceTokenizer are more those people which have very special types of fields, like a list of whitespace-separated tokens used for facetting or stuff like a list of product numbers. These types of tokens were always good to handle with WhitespaceTokenizer. If you wanted to keep your facet tokens together, you were able to use NBSP! So a change here would be a break for those apps

          So I would just update documentation to explain what this thing does (splitting on whitespace and not on spaces in general).

          Show
          Uwe Schindler added a comment - My personal opinion on this: The thing is called WhitespaceTokenizer, so it should do what the name says (split on isWhitespace). If we want something else, maybe provide a separate CharTokenizer implementation that also splits on NBSP In general, whitespace tokenizer is not used for "classical" fulltext. For this type of text one would better use StandardTokenizer, ICU's Tokenizers or the language specific ones for Chinese or Japan. People using WhitespaceTokenizer are more those people which have very special types of fields, like a list of whitespace-separated tokens used for facetting or stuff like a list of product numbers. These types of tokens were always good to handle with WhitespaceTokenizer. If you wanted to keep your facet tokens together, you were able to use NBSP! So a change here would be a break for those apps So I would just update documentation to explain what this thing does (splitting on whitespace and not on spaces in general).
          Hide
          Jack Krupansky added a comment - - edited

          +1 for using the Unicode definition of white space rather than the (odd) Java definition. From a Solr user perspective, the fact that Java is used for implementation under the hood should be irrelevant. That said, the Javadoc for WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already.

          The term "non-breaking white space" explicitly refers to line breaking and has no mention of tokens in either Unicode or traditional casual usage.

          From a Solr user perspective, there is like zero value to having NBSP from HTML web pages being treated as if it were not traditional white space.

          From a Solr user perspective, the primary use of whitespace tokenizer is to avoid the fact that standard tokenizer breaks on various special characters such as occur in product numbers.

          One of the ongoing problems in the Solr community is the sheer amount of time spent explaining nuances and gotchas, even if they do happen to be documented somewhere in the fine print - no sane user reads the fine print anyway. No Solr user actually uses WhitespaceTokenizer directly - they reference WhitespaceTokenizerFactory, and then having to drop down to Lucene and Java for doc is way too much to ask a typical Solr user. Our collective goal should be to minimize nuances and gotchas (IMHO.)

          In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile.

          Ugh... there are plenty of other places in doc for other tokenizers and filters that refer to "whitespace" and need to address this same issue, either to treat NBSP as white space or doc the nuance/gotcha much more thoroughly and effectively.

          OTOH... an alternative view... having so many un/poorly-documented nuances and gotchas is money in the pockets of consultants and a great argument in favor of Solr users maximizing the employment of Solr consultants.

          Show
          Jack Krupansky added a comment - - edited +1 for using the Unicode definition of white space rather than the (odd) Java definition. From a Solr user perspective, the fact that Java is used for implementation under the hood should be irrelevant. That said, the Javadoc for WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already. The term "non-breaking white space" explicitly refers to line breaking and has no mention of tokens in either Unicode or traditional casual usage. From a Solr user perspective, there is like zero value to having NBSP from HTML web pages being treated as if it were not traditional white space. From a Solr user perspective, the primary use of whitespace tokenizer is to avoid the fact that standard tokenizer breaks on various special characters such as occur in product numbers. One of the ongoing problems in the Solr community is the sheer amount of time spent explaining nuances and gotchas, even if they do happen to be documented somewhere in the fine print - no sane user reads the fine print anyway. No Solr user actually uses WhitespaceTokenizer directly - they reference WhitespaceTokenizerFactory, and then having to drop down to Lucene and Java for doc is way too much to ask a typical Solr user. Our collective goal should be to minimize nuances and gotchas (IMHO.) In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. Ugh... there are plenty of other places in doc for other tokenizers and filters that refer to "whitespace" and need to address this same issue, either to treat NBSP as white space or doc the nuance/gotcha much more thoroughly and effectively. OTOH... an alternative view... having so many un/poorly-documented nuances and gotchas is money in the pockets of consultants and a great argument in favor of Solr users maximizing the employment of Solr consultants.
          Hide
          Robert Muir added a comment -

          I don't think we should make yet another definition of whitespace for java, there are already effectively 3 (Java isWhiteSpace(), java isSpaceChar(), and unicode whitespace): I think it would be better to just expose "unicode whitespace space" for situations like this.

          java isSpaceChar is impractical, lets not even go there: it does not include controls such as tabs. so isSpaceChar('\t') == false and so on.

          unicode whitespace is probably more useful and already well-defined:

          http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#isUWhiteSpace%28int%29
          http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:White_Space=Yes:]

          Show
          Robert Muir added a comment - I don't think we should make yet another definition of whitespace for java, there are already effectively 3 (Java isWhiteSpace(), java isSpaceChar(), and unicode whitespace): I think it would be better to just expose "unicode whitespace space" for situations like this. java isSpaceChar is impractical, lets not even go there: it does not include controls such as tabs. so isSpaceChar('\t') == false and so on. unicode whitespace is probably more useful and already well-defined: http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#isUWhiteSpace%28int%29 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:White_Space=Yes: ]
          Hide
          Uwe Schindler added a comment -

          In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile.

          As said before. If we want to change this we need a new Tokenizer with new name and a new Factory. Please don't add new matchVersion constants for that because this is a huge break. The Tokenizer does what it should and what is documentes: This is not a bug.

          And still this holds: Users should prefer StandardTokenizer, the wide usage of WhitespaceTokenizer is caused by tons of example configs from earlier Solr days that uses WhiteSpaceTokenizer together with broken WordDestroyerFilter. This is indeed only useful for product numbers, but not fulltext.

          Show
          Uwe Schindler added a comment - In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. As said before. If we want to change this we need a new Tokenizer with new name and a new Factory. Please don't add new matchVersion constants for that because this is a huge break. The Tokenizer does what it should and what is documentes: This is not a bug. And still this holds: Users should prefer StandardTokenizer, the wide usage of WhitespaceTokenizer is caused by tons of example configs from earlier Solr days that uses WhiteSpaceTokenizer together with broken WordDestroyerFilter. This is indeed only useful for product numbers, but not fulltext.
          Hide
          David Smiley added a comment -

          So maybe we should solve this problem by adding some documentation?

          If the vast majority (like 90%+) of users that currently use WhitespaceTokenizer would want to tokenize on it, then I don't think documentation is sufficient at all. Documenting something most people would want to change is very very easy to overlook. That's what I call a trap; not that there might be some uses for the current behavior. Lucene should do what most users want it do do by default. As Jack said, the users of the search platform don't care what Java's definition of Character.isWhitespace is.

          I propose WhitespaceTokenizerFactory have a flag for this, and that it default to consider NBSP a space based on Lucene's Version.

          I get Uwe's point that there are other Tokenizers. But I disagree that WhitespaceTokenizer shouldn't be used for "classical full text". For example StandardTokenizer tokenizes on hypthen and thus foils some of the benefit of WordDelimiterFilter. Maybe ICUTokenizer is an answer; I haven't checked it's interaction with WDF. But why can't we just have a tokenizer that just tokenizes simply on all whitespace?

          I'll have to see the links Rob just posted; I haven't read them yet.

          Show
          David Smiley added a comment - So maybe we should solve this problem by adding some documentation? If the vast majority (like 90%+) of users that currently use WhitespaceTokenizer would want to tokenize on it, then I don't think documentation is sufficient at all. Documenting something most people would want to change is very very easy to overlook. That's what I call a trap ; not that there might be some uses for the current behavior. Lucene should do what most users want it do do by default. As Jack said, the users of the search platform don't care what Java's definition of Character.isWhitespace is. I propose WhitespaceTokenizerFactory have a flag for this, and that it default to consider NBSP a space based on Lucene's Version. I get Uwe's point that there are other Tokenizers. But I disagree that WhitespaceTokenizer shouldn't be used for "classical full text". For example StandardTokenizer tokenizes on hypthen and thus foils some of the benefit of WordDelimiterFilter. Maybe ICUTokenizer is an answer; I haven't checked it's interaction with WDF. But why can't we just have a tokenizer that just tokenizes simply on all whitespace? I'll have to see the links Rob just posted; I haven't read them yet.
          Hide
          Uwe Schindler added a comment -

          unicode whitespace is probably more useful and already well-defined

          I would solve this issue by adding a ICUWhiteSpaceTokenizer using this definition to the ICU module. Problem solved (it is in fact a small patch with a Tokenizer extends CharTokenizer and the factory).

          Show
          Uwe Schindler added a comment - unicode whitespace is probably more useful and already well-defined I would solve this issue by adding a ICUWhiteSpaceTokenizer using this definition to the ICU module. Problem solved (it is in fact a small patch with a Tokenizer extends CharTokenizer and the factory).
          Hide
          Robert Muir added a comment -

          You can add a CharTokenizer to ICU analysis module that just looks like this:

            protected boolean isTokenChar(int c) {
              return !UCharacter.isUWhiteSpace(c);
            }
          

          If you are not happy with it needing ICU library, the definition of this property in ICU is "Space characters+TAB+CR+LF-ZWSP-ZWNBSP" (http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#WHITE_SPACE) so it shouldn't be hard to implement with just the jdk, but I do not know about the efficiency of that.

          Either way, I think it should just be a different tokenizer.

          Show
          Robert Muir added a comment - You can add a CharTokenizer to ICU analysis module that just looks like this: protected boolean isTokenChar( int c) { return !UCharacter.isUWhiteSpace(c); } If you are not happy with it needing ICU library, the definition of this property in ICU is "Space characters+TAB+CR+LF-ZWSP-ZWNBSP" ( http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#WHITE_SPACE ) so it shouldn't be hard to implement with just the jdk, but I do not know about the efficiency of that. Either way, I think it should just be a different tokenizer.
          Hide
          Uwe Schindler added a comment -

          Robert Muir: I am already preparing a patch

          Show
          Uwe Schindler added a comment - Robert Muir : I am already preparing a patch
          Hide
          Uwe Schindler added a comment - - edited

          Patch that adds the ICU variant of the tokenizer (it's just copypaste...). This is the only way to do this right: It conforms to Unicode and also uses Unicode's fast reference implementation.

          I also added a test.

          Show
          Uwe Schindler added a comment - - edited Patch that adds the ICU variant of the tokenizer (it's just copypaste...). This is the only way to do this right: It conforms to Unicode and also uses Unicode's fast reference implementation. I also added a test.
          Hide
          David Smiley added a comment -

          Uwe,
          I beg to differ on WDF but I think we can put that behind us. It's great to see a solution come together on using ICU's rules

          For something as trivial as detecting if the character is in a set, do we really need to depend on ICU? It would be so nice to not need it, even if it's internal implementation seems fast. Then we could consider deprecating WhitespaceTokenizer since, after all, why would one use it when ICUWhitespaceTokenizer exists? Anyone wanting atypical tokenization could easily subclass CharTokenizer or consider a MappingCharFilter.

          Show
          David Smiley added a comment - Uwe, I beg to differ on WDF but I think we can put that behind us. It's great to see a solution come together on using ICU's rules For something as trivial as detecting if the character is in a set, do we really need to depend on ICU? It would be so nice to not need it, even if it's internal implementation seems fast. Then we could consider deprecating WhitespaceTokenizer since, after all, why would one use it when ICUWhitespaceTokenizer exists? Anyone wanting atypical tokenization could easily subclass CharTokenizer or consider a MappingCharFilter.
          Hide
          Robert Muir added a comment -

          I don't think we need to deprecate whitespacetokenizer, i think its intuitive for a java developer, and lucene is a java API.

          If you really must avoid ICU: I already mentioned, you can probably implement it yourself, but i just don't think it will be as fast. you will probably have to precompute and cache for latin1 and all kinds of stuff to make it competitive. and it will be messier and so on.

          Show
          Robert Muir added a comment - I don't think we need to deprecate whitespacetokenizer, i think its intuitive for a java developer, and lucene is a java API. If you really must avoid ICU: I already mentioned, you can probably implement it yourself, but i just don't think it will be as fast. you will probably have to precompute and cache for latin1 and all kinds of stuff to make it competitive. and it will be messier and so on.
          Hide
          Uwe Schindler added a comment -

          Then we could consider deprecating WhitespaceTokenizer since, after all, why would one use it when ICUWhitespaceTokenizer exists?

          Because the non-breaking space is useful for stuff (as explained above) where you want to keep tokens together (although usicode standard speaks about line wrapping, but in any case, like soft hyphen vs. hyphen, its just a matter what you want to do: the NBSP just tells the tokenizer or line-breaker or how you call it to keep tokens together). The problem are people that misuse   in their HTML shit (e.g. tables). But for stuff I had implemented very often, I used WhitepsaceTokenizer to split tokens and placed a non-breaking space to keep tokens together.

          So there is no need to deprecate WhitespaceTokenizer. It does what it should do. ICUWhitespaceTokenizer is using same naming and does the same, just with different rules.

          ... or consider a MappingCharFilter

          This is thing is slow like hell. If you want it faster, e.g. use PatternTokenizer.

          It would be so nice to not need it, even if it's internal implementation seems fast

          The problem is that you need to do additional 4-way branching: You have to check in isTokenChar() that it is !Whitespace() and also exclude all those 3 chars we listed in the description: '\u00A0', '\u2007', '\u202F'.

          I agree with Robert: We should not change the default WhitespaceTokenizer and also not deprecate it. We should add a new one, which I did in supplied patch. If we want it in core, lets call it different and implement isTokenChar in a fast way without 3 additional branches.

          I beg to differ on WDF

          This is coming from the fact that Solr is often misused because users just give up to think about tokenization. WDF only makes sense in product catalogues, but it is definitely broken for fulltext. The product catalogues are of course some of our customers, but before I suggest to them that they should use WhiteSpaceTokenizer with WordDestroyerFilter, I would analyzer their root problem (why is their tokenization broken). This is why I am against the broken example configs in Solr we had in the past. Because WST and WDF should really only be used as a last resort.

          Show
          Uwe Schindler added a comment - Then we could consider deprecating WhitespaceTokenizer since, after all, why would one use it when ICUWhitespaceTokenizer exists? Because the non-breaking space is useful for stuff (as explained above) where you want to keep tokens together (although usicode standard speaks about line wrapping, but in any case, like soft hyphen vs. hyphen, its just a matter what you want to do: the NBSP just tells the tokenizer or line-breaker or how you call it to keep tokens together). The problem are people that misuse   in their HTML shit (e.g. tables). But for stuff I had implemented very often, I used WhitepsaceTokenizer to split tokens and placed a non-breaking space to keep tokens together. So there is no need to deprecate WhitespaceTokenizer. It does what it should do. ICUWhitespaceTokenizer is using same naming and does the same, just with different rules. ... or consider a MappingCharFilter This is thing is slow like hell. If you want it faster, e.g. use PatternTokenizer. It would be so nice to not need it, even if it's internal implementation seems fast The problem is that you need to do additional 4-way branching: You have to check in isTokenChar() that it is !Whitespace() and also exclude all those 3 chars we listed in the description: '\u00A0', '\u2007', '\u202F' . I agree with Robert: We should not change the default WhitespaceTokenizer and also not deprecate it. We should add a new one, which I did in supplied patch. If we want it in core, lets call it different and implement isTokenChar in a fast way without 3 additional branches. I beg to differ on WDF This is coming from the fact that Solr is often misused because users just give up to think about tokenization. WDF only makes sense in product catalogues, but it is definitely broken for fulltext. The product catalogues are of course some of our customers, but before I suggest to them that they should use WhiteSpaceTokenizer with WordDestroyerFilter, I would analyzer their root problem (why is their tokenization broken). This is why I am against the broken example configs in Solr we had in the past. Because WST and WDF should really only be used as a last resort.
          Hide
          Steve Rowe added a comment -

          A JFlex version would be fast and simple and not require ICU to keep up with Unicode changes. Not sure about needing to cache Latin-1 and other stuff to be competitive. I'll give it a go later today.

          Show
          Steve Rowe added a comment - A JFlex version would be fast and simple and not require ICU to keep up with Unicode changes. Not sure about needing to cache Latin-1 and other stuff to be competitive. I'll give it a go later today.
          Hide
          Uwe Schindler added a comment -

          For those people that tend to think that StandardTokenizer is too aggressive, I generally suggest to my users to use ClassicTokenizer if they know that they only have European languages in their index. This is one reason why I was always against removing/deprecating ClassicTokenizer! It does a really good job for plain european text and also keeps most product numbers together (because it has quite good heuristics). It only breaks product numbers with slashes (e.g., "PS-10/20").

          Show
          Uwe Schindler added a comment - For those people that tend to think that StandardTokenizer is too aggressive, I generally suggest to my users to use ClassicTokenizer if they know that they only have European languages in their index. This is one reason why I was always against removing/deprecating ClassicTokenizer! It does a really good job for plain european text and also keeps most product numbers together (because it has quite good heuristics). It only breaks product numbers with slashes (e.g., "PS-10/20").
          Hide
          Uwe Schindler added a comment - - edited

          One thing to make it full flexible in Lucene Trunk (Java 8 only). I know this would not help Solr users that want to define the Tokenizer in a config file, but for real Lucene users the Java 8-like way would be the following static method on CharTokenizer:

          public static CharTokenizer fromPredicate(java.util.function.IntPredicate predicate)
          

          This would allow to define a new CharTokenizer with a single line statement using any predicate:

          // long variant with lambda:
          Tokenizer tok = CharTokenizer.fromPredicate(c -> !UCharacter.isUWhiteSpace(c));
          
          // method reference:
          Tokenizer tok = CharTokenizer.fromPredicate( (UCharacter::isUWhiteSpace).negate() );
          
          // method reference to custom function:
          private boolean myTestFunction(int c) {
           return (cracy condition);
          }
          
          Tokenizer tok = CharTokenizer.fromPredicate(this::myTestFunction);
          

          I think we should do this in a separate issue in Lucene trunk for Java 8. This is really the way for which Java 8 Lambdas are made for. And its fast like hell, because its compiled to native bytecode so there is no call overhead.

          Show
          Uwe Schindler added a comment - - edited One thing to make it full flexible in Lucene Trunk (Java 8 only). I know this would not help Solr users that want to define the Tokenizer in a config file, but for real Lucene users the Java 8-like way would be the following static method on CharTokenizer: public static CharTokenizer fromPredicate(java.util.function.IntPredicate predicate) This would allow to define a new CharTokenizer with a single line statement using any predicate: // long variant with lambda: Tokenizer tok = CharTokenizer.fromPredicate(c -> !UCharacter.isUWhiteSpace(c)); // method reference: Tokenizer tok = CharTokenizer.fromPredicate( (UCharacter::isUWhiteSpace).negate() ); // method reference to custom function: private boolean myTestFunction( int c) { return (cracy condition); } Tokenizer tok = CharTokenizer.fromPredicate( this ::myTestFunction); I think we should do this in a separate issue in Lucene trunk for Java 8. This is really the way for which Java 8 Lambdas are made for. And its fast like hell, because its compiled to native bytecode so there is no call overhead.
          Hide
          Uwe Schindler added a comment -

          I opened LUCENE-6879 for the idea (which is related to this issue as it provides a simple and general way suitable for Java 8).

          Show
          Uwe Schindler added a comment - I opened LUCENE-6879 for the idea (which is related to this issue as it provides a simple and general way suitable for Java 8).
          Hide
          Steve Rowe added a comment -

          Patch adding a JFlex-based UnicodeWhitespaceTokenizer(Factory), along with some performance testing bits, some of which aren't commitable (hard-coded paths). Also includes the SPI entry missing from Uwe's patch for ICUWhitespaceTokenizerFactory in lucene/icu/src/resources/META-INF/services/o.a.l.analysis.util.TokenizerFactory, as well as a couple bugfixes for lucene/benchmark (which I'll commit under a separate JIRA). The patch also includes all of Uwe's patch.

          I did three performance comparisons on my Macbook Pro with Oracle Java 1.8.0_20 of the Character.isWhitespace()-based WhitespaceTokenizer, Uwe's ICUWhitespaceTokenizer, and the JFlex UnicodeWhitespaceTokenizer:

          1. Using the wstok.alg in the patch, I ran lucene/benchmark over 20k (English news) Reuters docs: dropping the lowest throughput of 5 rounds and averaging the other 4:

          Tokenizer Avg tok/sec Throughput compared to WhitespaceTokenizer
          WhitespaceTokenizer 1.515M N/A
          ICUWhitespaceTokenizer 1.447M -5.5%
          UnicodeWhitespaceTokenizer 1.514M -0.1%

          2. I concatenated all ~20k Reuters docs into one file, loaded it into memory and then ran each tokenizer over it 11 times, discarding info from the first and averaging the other 10 (this is testReuters() in the Test* files in the patch:

          Tokenizer Avg tok/sec Throughput compared to WhitespaceTokenizer
          WhitespaceTokenizer 14.47M N/A
          ICUWhitespaceTokenizer 9.26M -36%
          UnicodeWhitespaceTokenizer 11.60M -20%

          3. I used a fixed random seed and generated 10k random Unicode strings of at most 10k chars using TestUtil.randomUnicodeString(). Note that this is non-realistic data for tokenization, not least because the average whitespace density is very low compared to natural language. In running this test I noticed that WhitespaceTokenizer was returning many more tokens than the other two, and I tracked it down to differences in the definition of whitespace:

          • Character.isWhitespace() returns true for the following while Unicode 6.3.0 (Lucene's current Unicode version) does not: U+001C, U+001D, U+001E, U+001F, U+0180E. (U+0180E was removed from Unicode's whitespace definition in Unicode 6.3.0. Java 8 uses Unicode 6.2.0.)
          • Unicode 6.3.0 says the following are whitespace while Character.isWhitespace() does not: U+0085, U+00A0, U+2007, U+202F. The last 3 are documented, but U+0085 NEXT LINE (NEL) isn't documented anywhere I can see; it was added to Unicode's whitespace definition in Unicode 3.0 (released 2001).

          So in order to be able to directly compare the performance of three tokenizers over this data, I replaced all non-consensus whitespace characters with a space before running the test.

          Tokenizer Avg tok/sec Throughput compared to WhitespaceTokenizer
          WhitespaceTokenizer 897k N/A
          ICUWhitespaceTokenizer 880k -2%
          UnicodeWhitespaceTokenizer 1,605k +79%

          One other thing I noticed for this test when I compared ICUWhitespaceTokenizer's output with that of UnicodeWhitespaceTokenizer's: they don't always find the same break points. This is because although both forcibly break at the max token length (255 chars, fixed for CharTokenizer and the default for Lucene's JFlex scanners), CharTokenizer allows tokens to exceed its max token char length of 255 by one char when a surrogate pair would otherwise be broken, while Lucene's JFlex scanners break at 254 chars in this case.


          Conclusion: for throughput over realistic ASCII data, the original WhitespaceTokenizer performs best, followed by the JFlex-based tokenizer in this patch (UnicodeWhitespaceTokenizer), followed by the ICU-based ICUWhitespaceTokenizer in Uwe's patch.

          Show
          Steve Rowe added a comment - Patch adding a JFlex-based UnicodeWhitespaceTokenizer(Factory), along with some performance testing bits, some of which aren't commitable (hard-coded paths). Also includes the SPI entry missing from Uwe's patch for ICUWhitespaceTokenizerFactory in lucene/icu/src/resources/META-INF/services/o.a.l.analysis.util.TokenizerFactory , as well as a couple bugfixes for lucene/benchmark (which I'll commit under a separate JIRA). The patch also includes all of Uwe's patch. I did three performance comparisons on my Macbook Pro with Oracle Java 1.8.0_20 of the Character.isWhitespace() -based WhitespaceTokenizer , Uwe's ICUWhitespaceTokenizer , and the JFlex UnicodeWhitespaceTokenizer : 1. Using the wstok.alg in the patch, I ran lucene/benchmark over 20k (English news) Reuters docs: dropping the lowest throughput of 5 rounds and averaging the other 4: Tokenizer Avg tok/sec Throughput compared to WhitespaceTokenizer WhitespaceTokenizer 1.515M N/A ICUWhitespaceTokenizer 1.447M -5.5% UnicodeWhitespaceTokenizer 1.514M -0.1% 2. I concatenated all ~20k Reuters docs into one file, loaded it into memory and then ran each tokenizer over it 11 times, discarding info from the first and averaging the other 10 (this is testReuters() in the Test* files in the patch: Tokenizer Avg tok/sec Throughput compared to WhitespaceTokenizer WhitespaceTokenizer 14.47M N/A ICUWhitespaceTokenizer 9.26M -36% UnicodeWhitespaceTokenizer 11.60M -20% 3. I used a fixed random seed and generated 10k random Unicode strings of at most 10k chars using TestUtil.randomUnicodeString() . Note that this is non-realistic data for tokenization, not least because the average whitespace density is very low compared to natural language. In running this test I noticed that WhitespaceTokenizer was returning many more tokens than the other two, and I tracked it down to differences in the definition of whitespace: Character.isWhitespace() returns true for the following while Unicode 6.3.0 (Lucene's current Unicode version) does not: U+001C, U+001D, U+001E, U+001F, U+0180E. (U+0180E was removed from Unicode's whitespace definition in Unicode 6.3.0. Java 8 uses Unicode 6.2.0.) Unicode 6.3.0 says the following are whitespace while Character.isWhitespace() does not: U+0085, U+00A0, U+2007, U+202F. The last 3 are documented, but U+0085 NEXT LINE (NEL) isn't documented anywhere I can see; it was added to Unicode's whitespace definition in Unicode 3.0 (released 2001). So in order to be able to directly compare the performance of three tokenizers over this data, I replaced all non-consensus whitespace characters with a space before running the test. Tokenizer Avg tok/sec Throughput compared to WhitespaceTokenizer WhitespaceTokenizer 897k N/A ICUWhitespaceTokenizer 880k -2% UnicodeWhitespaceTokenizer 1,605k +79% One other thing I noticed for this test when I compared ICUWhitespaceTokenizer 's output with that of UnicodeWhitespaceTokenizer 's: they don't always find the same break points. This is because although both forcibly break at the max token length (255 chars, fixed for CharTokenizer and the default for Lucene's JFlex scanners), CharTokenizer allows tokens to exceed its max token char length of 255 by one char when a surrogate pair would otherwise be broken, while Lucene's JFlex scanners break at 254 chars in this case. Conclusion: for throughput over realistic ASCII data, the original WhitespaceTokenizer performs best, followed by the JFlex-based tokenizer in this patch ( UnicodeWhitespaceTokenizer ), followed by the ICU-based ICUWhitespaceTokenizer in Uwe's patch.
          Hide
          Uwe Schindler added a comment -

          Thanks Steve! I just noticed that your patch contains hardcoded filenames of your local system, I think those are just leftovers from your testing with reuters. Otherwise I am not happy about the size of the generated files, but thats how jflex works...

          Sorry for forgetting to add the analyzer factory, I was just too fast in copypasting code yesterday Thanks for adding.

          Show
          Uwe Schindler added a comment - Thanks Steve! I just noticed that your patch contains hardcoded filenames of your local system, I think those are just leftovers from your testing with reuters. Otherwise I am not happy about the size of the generated files, but thats how jflex works... Sorry for forgetting to add the analyzer factory, I was just too fast in copypasting code yesterday Thanks for adding.
          Hide
          Steve Rowe added a comment -

          I just noticed that your patch contains hardcoded filenames of your local system, I think those are just leftovers from your testing with reuters.

          Yup, patch needs cleanup before it can be committed. I figured the decision about what to do hasn't been made yet, so I'll wait on doing that work until then.

          Otherwise I am not happy about the size of the generated files, but thats how jflex works...

          I don't think the generated Java source makes much difference - the thing people will deal with is the JFlex source, and it's fairly compact. I looked at the .class file sizes on my system, and I see 13k for the JFlex version and 2k for the ICU version.

          Show
          Steve Rowe added a comment - I just noticed that your patch contains hardcoded filenames of your local system, I think those are just leftovers from your testing with reuters. Yup, patch needs cleanup before it can be committed. I figured the decision about what to do hasn't been made yet, so I'll wait on doing that work until then. Otherwise I am not happy about the size of the generated files, but thats how jflex works... I don't think the generated Java source makes much difference - the thing people will deal with is the JFlex source, and it's fairly compact. I looked at the .class file sizes on my system, and I see 13k for the JFlex version and 2k for the ICU version.
          Hide
          David Smiley added a comment -

          Nice thorough job Steve!

          I propose that we consolidate the TokenizerFactories here into one – the existing WhitespaceTokenizerFactory. I think this is more user friendly. An attribute could select which whitespace definition the user wants: "java" or "unicode". What do you think?

          Show
          David Smiley added a comment - Nice thorough job Steve! I propose that we consolidate the TokenizerFactories here into one – the existing WhitespaceTokenizerFactory. I think this is more user friendly. An attribute could select which whitespace definition the user wants: "java" or "unicode". What do you think?
          Hide
          Steve Rowe added a comment -

          I propose that we consolidate the TokenizerFactories here into one – the existing WhitespaceTokenizerFactory. I think this is more user friendly. An attribute could select which whitespace definition the user wants: "java" or "unicode". What do you think?

          Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in analyzers-common.

          I'm okay with adding a param to WhitespaceTokenizerFactory (not sure what to name it though: "authority"/"style"/"definition"?) . Since the default wouldn't change ("java" would be the default I assume), I don't think we need to introduce luceneMatchVersion.

          Show
          Steve Rowe added a comment - I propose that we consolidate the TokenizerFactories here into one – the existing WhitespaceTokenizerFactory. I think this is more user friendly. An attribute could select which whitespace definition the user wants: "java" or "unicode". What do you think? Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in analyzers-common. I'm okay with adding a param to WhitespaceTokenizerFactory (not sure what to name it though: "authority"/"style"/"definition"?) . Since the default wouldn't change ("java" would be the default I assume), I don't think we need to introduce luceneMatchVersion.
          Hide
          Jack Krupansky added a comment -

          Certainly Solr can update its example schemas to use whatever alternative tokenizer or option is decided on so that Solr users, many of whom are not Java developers, will no longer fall into this NBSP trap, but... that still feels like a less than desirable resolution.

          Uwe Schindler, could you elaborate more specifically on the existing use case that you are trying to preserve? I mean, like in terms of a real-world example. Where do some of your NBSPs actually live in the wild?

          It seems to me that the vast majority of normal users would not be negatively impacted by having "white space" be defined using the Unicode model. I never objected to using the Java model, but that's because I had overlooked this nuance of NBSP. My concern for Solr users is that NBSP occurs somewhat commonly in HTML web pages - as a formatting technique more than an attempt at influencing tokenization.

          Show
          Jack Krupansky added a comment - Certainly Solr can update its example schemas to use whatever alternative tokenizer or option is decided on so that Solr users, many of whom are not Java developers, will no longer fall into this NBSP trap, but... that still feels like a less than desirable resolution. Uwe Schindler , could you elaborate more specifically on the existing use case that you are trying to preserve? I mean, like in terms of a real-world example. Where do some of your NBSPs actually live in the wild? It seems to me that the vast majority of normal users would not be negatively impacted by having "white space" be defined using the Unicode model. I never objected to using the Java model, but that's because I had overlooked this nuance of NBSP. My concern for Solr users is that NBSP occurs somewhat commonly in HTML web pages - as a formatting technique more than an attempt at influencing tokenization.
          Hide
          Steve Rowe added a comment - - edited

          My concern for Solr users is that NBSP occurs somewhat commonly in HTML web pages - as a formatting technique more than an attempt at influencing tokenization.

          FYI,   is converted to U+0020 by HTMLStripCharFilter.

          Show
          Steve Rowe added a comment - - edited My concern for Solr users is that NBSP occurs somewhat commonly in HTML web pages - as a formatting technique more than an attempt at influencing tokenization. FYI,   is converted to U+0020 by HTMLStripCharFilter .
          Hide
          Jack Krupansky added a comment -

          Tika is the other (main?) approach to ingesting text from HTML web pages. I haven't checked exactly what it does on  .

          Maybe David Smiley could elaborate on which use case he was encountering that inspired this Jira issue.

          Show
          Jack Krupansky added a comment - Tika is the other (main?) approach to ingesting text from HTML web pages. I haven't checked exactly what it does on  . Maybe David Smiley could elaborate on which use case he was encountering that inspired this Jira issue.
          Hide
          Jack Krupansky added a comment -

          Because WST and WDF should really only be used as a last resort.

          Absolutely agreed. From a Solr user perspective we really need a much simpler model for semi-standard tokens out of the box without the user having to scratch their heads and resorting to WST in the first (last) place. LOL - maybe if we could eliminate this need to resort to WST, we wouldn't have to fret as much about WST.

          I generally suggest to my users to use ClassicTokenizer

          Personally, I've always refrained from recommending CT since I thought ST was supposed to replace it and that the email and URL support was considered an excess not worth keeping. I've considered CT as if it were deprecated (which it is not.) And, I never see anybody else recommending it on the user list. And, the fact that it can't handle slashes for product number is a deal killer. I'm not sure that I would argue in favor of resurrecting CT as a first-class recommendation, especially since it can't handle non-European languages, but...

          That said, I do think it is worth separately (from this Jira) considering a fresh, new tokenizer that starts with the goodness of ST and adds in an approximation of the reasons that people resort to WST. Whether that can be an option on ST or has to be a separate tokenizer would need to be debated. I'd prefer an option on ST, either to simply allow embedded special characters or to specify a list or regex of special character to be allowed or excluded.

          People would still need to combine NewT with WDF, but at least the tokenization would be more explicit.

          Personally I would prefer to see an option for whether to retain or strip external punctuation vs. embedded special characters. Trailing periods and commas and columns and enclosing parentheses are just the kinds of things we had to resort to WDF for when using WST to retain embedded special characters.

          And if people really want to be ambitious, a totally new tokenizer that subsumed the good parts of WDF would make a lot of lives of Solr users much easier.

          Show
          Jack Krupansky added a comment - Because WST and WDF should really only be used as a last resort. Absolutely agreed. From a Solr user perspective we really need a much simpler model for semi-standard tokens out of the box without the user having to scratch their heads and resorting to WST in the first (last) place. LOL - maybe if we could eliminate this need to resort to WST, we wouldn't have to fret as much about WST. I generally suggest to my users to use ClassicTokenizer Personally, I've always refrained from recommending CT since I thought ST was supposed to replace it and that the email and URL support was considered an excess not worth keeping. I've considered CT as if it were deprecated (which it is not.) And, I never see anybody else recommending it on the user list. And, the fact that it can't handle slashes for product number is a deal killer. I'm not sure that I would argue in favor of resurrecting CT as a first-class recommendation, especially since it can't handle non-European languages, but... That said, I do think it is worth separately (from this Jira) considering a fresh, new tokenizer that starts with the goodness of ST and adds in an approximation of the reasons that people resort to WST. Whether that can be an option on ST or has to be a separate tokenizer would need to be debated. I'd prefer an option on ST, either to simply allow embedded special characters or to specify a list or regex of special character to be allowed or excluded. People would still need to combine NewT with WDF, but at least the tokenization would be more explicit. Personally I would prefer to see an option for whether to retain or strip external punctuation vs. embedded special characters. Trailing periods and commas and columns and enclosing parentheses are just the kinds of things we had to resort to WDF for when using WST to retain embedded special characters. And if people really want to be ambitious, a totally new tokenizer that subsumed the good parts of WDF would make a lot of lives of Solr users much easier.
          Hide
          David Smiley added a comment -

          Jack: My use-case since you asked: I've got a document store of content in XML that provides various markup around mostly text. These documents occasionally have an NBSP. I process it outside of Solr to produce the text I want indexed/stored – it's not XML any more. An NBSP entity, if found, is converted to the NBSP character naturally as part of Java's XML libraries (no explicit decision on my part).

          Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in analyzers-common.

          Right; ah well.

          RE what to name the attribute: I suggest "definition" or even better: "rule" (or "ruleset")

          I do think the first-line sentence of these whitespace tokenizers should point to what definition of whitespace is chosen. And that they reference each other so that anyone stumbling on them will know of the other.

          RE WDF: I prefer WhitespaceTokenizer with WDF for not just product-id data but also full-text. Full-text might contain product-ids, or have things like "wi-fi" and many other words, like say "thread-safe" or "co-worker" that are sometimes hyphenated, sometimes not; some of these might be space-separated; etc.. WDF is very flexible but if you use a Tokenizer like Standard* or Classic* then hyphen will be pre-tokenized before WDF can do its thing, neutering part of its benefit. I wish WDF kept payloads and other attributes; but it's not the only offender here, and likewise for the bigger issue of positionLength. Otherwise I'm a WDF fan Nonetheless I like some of Jack's ideas on a better tokenizer that subsumes WDF.

          BTW, FWIW if I had to write a WhitespaceTokenizer from scratch, I'd implement it as a bitset for characters < 65k (this is 8KB memory). For the remainder I'd use an array that is scanned; but it appears there are none beyond 65k as I look at a table of these char's from a quick google search. Then a configurable definition loader could fill named whitespace rules and it might be configurable to add or remove certain codes. But no need to bother; Steve's impl is fine

          Show
          David Smiley added a comment - Jack: My use-case since you asked: I've got a document store of content in XML that provides various markup around mostly text. These documents occasionally have an NBSP. I process it outside of Solr to produce the text I want indexed/stored – it's not XML any more. An NBSP entity, if found, is converted to the NBSP character naturally as part of Java's XML libraries (no explicit decision on my part). Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in analyzers-common. Right; ah well. RE what to name the attribute: I suggest "definition" or even better: "rule" (or "ruleset") I do think the first-line sentence of these whitespace tokenizers should point to what definition of whitespace is chosen. And that they reference each other so that anyone stumbling on them will know of the other. RE WDF: I prefer WhitespaceTokenizer with WDF for not just product-id data but also full-text. Full-text might contain product-ids, or have things like "wi-fi" and many other words, like say "thread-safe" or "co-worker" that are sometimes hyphenated, sometimes not; some of these might be space-separated; etc.. WDF is very flexible but if you use a Tokenizer like Standard* or Classic* then hyphen will be pre-tokenized before WDF can do its thing, neutering part of its benefit. I wish WDF kept payloads and other attributes; but it's not the only offender here, and likewise for the bigger issue of positionLength. Otherwise I'm a WDF fan Nonetheless I like some of Jack's ideas on a better tokenizer that subsumes WDF. BTW, FWIW if I had to write a WhitespaceTokenizer from scratch, I'd implement it as a bitset for characters < 65k (this is 8KB memory). For the remainder I'd use an array that is scanned; but it appears there are none beyond 65k as I look at a table of these char's from a quick google search. Then a configurable definition loader could fill named whitespace rules and it might be configurable to add or remove certain codes. But no need to bother; Steve's impl is fine
          Hide
          Yonik Seeley added a comment -

          I'd implement it as a bitset for characters < 65k

          A single word can be useful for quickly ruling out whitespace and checking for common whitespace. Example:
          https://github.com/yonik/noggit/blob/master/src/main/java/org/noggit/JSONParser.java#L241

          Show
          Yonik Seeley added a comment - I'd implement it as a bitset for characters < 65k A single word can be useful for quickly ruling out whitespace and checking for common whitespace. Example: https://github.com/yonik/noggit/blob/master/src/main/java/org/noggit/JSONParser.java#L241
          Hide
          David Smiley added a comment -

          Here's an updated patch working from Steve's:

          • The existing WhitespaceTokenizerFactory can be configured via the "rule" parameter to use the "unicode" rule or the "java" (default) rule. The maxTokenLength parameter is now here too. If you use the "java" rule then maxTokenLength, if specified, is only permitted to be 255. The UnicodeWhitespaceTokenizerFactory was removed since it's now combined.
            • added a simple testFactory test for this factory
          • Tweaked the javadocs as I mentioned.
          • Removed some of the test methods Steve added that were actually not tests but performance measurements (that also wrote to stderr).
          • Resolved various pre-commit issues (ASL header, svn props)

          If I hear no more feedback then I plan to commit Tuesday night (EST)

          Show
          David Smiley added a comment - Here's an updated patch working from Steve's: The existing WhitespaceTokenizerFactory can be configured via the "rule" parameter to use the "unicode" rule or the "java" (default) rule. The maxTokenLength parameter is now here too. If you use the "java" rule then maxTokenLength, if specified, is only permitted to be 255. The UnicodeWhitespaceTokenizerFactory was removed since it's now combined. added a simple testFactory test for this factory Tweaked the javadocs as I mentioned. Removed some of the test methods Steve added that were actually not tests but performance measurements (that also wrote to stderr). Resolved various pre-commit issues (ASL header, svn props) If I hear no more feedback then I plan to commit Tuesday night (EST)
          Hide
          Robert Muir added a comment -

          The shared factory is confusing: this is supposed to be a simple thing. Now we have some parameters that depend on other parameters and so on. Please, make two factories. We have to be able to maintain this stuff.

          Why is the ICUWhitespace being added?

          Show
          Robert Muir added a comment - The shared factory is confusing: this is supposed to be a simple thing. Now we have some parameters that depend on other parameters and so on. Please, make two factories. We have to be able to maintain this stuff. Why is the ICUWhitespace being added?
          Hide
          David Smiley added a comment -

          Sorry, I really disagree with you on this. I don't think this WhitespaceTokenizerFactory is hard to maintain at all. It's true that it's harder only because it was a trivial factory before but so what? Most importantly, I think it's a better user experience – nobody should care what the specific Java Tokenizer implementation class will be coming out of the factory – it's a tokenizer on whitespace using whatever definition/rule of whitespace they configured. That could hypothetically be implemented using one Java Tokenizer implementing class or multiple but that's an implementation detail.

          Why is the ICUWhitespace being added?

          I'll remove that in a new patch; I wasn't sure what to do but it's redundant so no need for it.

          Show
          David Smiley added a comment - Sorry, I really disagree with you on this. I don't think this WhitespaceTokenizerFactory is hard to maintain at all. It's true that it's harder only because it was a trivial factory before but so what? Most importantly, I think it's a better user experience – nobody should care what the specific Java Tokenizer implementation class will be coming out of the factory – it's a tokenizer on whitespace using whatever definition/rule of whitespace they configured. That could hypothetically be implemented using one Java Tokenizer implementing class or multiple but that's an implementation detail. Why is the ICUWhitespace being added? I'll remove that in a new patch; I wasn't sure what to do but it's redundant so no need for it.
          Hide
          Uwe Schindler added a comment -

          Yeah remove it! LUCENE-6879 is enough to quickly define your own WhitespaceTokenizer with ICU, if you want.

          Show
          Uwe Schindler added a comment - Yeah remove it! LUCENE-6879 is enough to quickly define your own WhitespaceTokenizer with ICU, if you want.
          Hide
          Adrien Grand added a comment -

          I tend to like Uwe's idea. I have often wondered what the actual use-cases of WhitespaceTokenizer were but did not suggest to remove it as the cost of maintenance was very low given its simplicity. However now that there is some controversy arising and given how simple it is to create character-based tokenizers in trunk Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace);, maybe we should just remove this tokenizer and let users define it themselves with the more flexible CharTokenizer.fromSeparatorCharPredicate?

          Show
          Adrien Grand added a comment - I tend to like Uwe's idea. I have often wondered what the actual use-cases of WhitespaceTokenizer were but did not suggest to remove it as the cost of maintenance was very low given its simplicity. However now that there is some controversy arising and given how simple it is to create character-based tokenizers in trunk Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace); , maybe we should just remove this tokenizer and let users define it themselves with the more flexible CharTokenizer.fromSeparatorCharPredicate ?
          Hide
          David Smiley added a comment -

          Just for clarification, Adrien, are you suggesting that WhitespaceTokenizerFactory stays, but WhitespaceTokenizer get removed (because it's easy to define one)? I'm +1 to that.

          Show
          David Smiley added a comment - Just for clarification, Adrien, are you suggesting that WhitespaceTokenizerFactory stays, but WhitespaceTokenizer get removed (because it's easy to define one)? I'm +1 to that.
          Hide
          Uwe Schindler added a comment -

          I would be fine to remove WhitespaceTokenizer in Lucene trunk. In that case I would also like to move the abstract CharTokenizer class out of oal.analysis.util to oal.analysis.core. This is not a big deal, but the util package is not fine for a first class citizen.

          I also have another idea about this issue; I would prefer to not have the large java code with jflex involved. Wouldn't it be possible to save the isWhitespace data of Unicode in a compressible Lucene bitset and save it to disk as resource file? We could then load the bitset (like deleted documents) from a resource file and wrap a simple CharTokenizer.fromSeparatorCharPredicate(c -> compressedBitset.get(c)) on top? The bitset could be generated from Unicode data on "ant regenerate".

          Show
          Uwe Schindler added a comment - I would be fine to remove WhitespaceTokenizer in Lucene trunk. In that case I would also like to move the abstract CharTokenizer class out of oal.analysis.util to oal.analysis.core. This is not a big deal, but the util package is not fine for a first class citizen. I also have another idea about this issue; I would prefer to not have the large java code with jflex involved. Wouldn't it be possible to save the isWhitespace data of Unicode in a compressible Lucene bitset and save it to disk as resource file? We could then load the bitset (like deleted documents) from a resource file and wrap a simple CharTokenizer.fromSeparatorCharPredicate(c -> compressedBitset.get(c)) on top? The bitset could be generated from Unicode data on "ant regenerate".
          Hide
          David Smiley added a comment -

          Uwe,
          Why persist the bitset and deal with the build issues around that when instead it could be done in-memory in a static initializer? It's so cheap to build that I question the effort in pre-building it as part of the build process.

          On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I have in the patch with the "rule" attribute to pick?

          Show
          David Smiley added a comment - Uwe, Why persist the bitset and deal with the build issues around that when instead it could be done in-memory in a static initializer? It's so cheap to build that I question the effort in pre-building it as part of the build process. On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I have in the patch with the "rule" attribute to pick?
          Hide
          Uwe Schindler added a comment -

          Why persist the bitset and deal with the build issues around that when instead it could be done in-memory in a static initializer? It's so cheap to build that I question the effort in pre-building it as part of the build process.

          Where to get the information for the bitset from? The Unicode data as of java.lang.Character class is wrong! My idea was to use a Unicode data file and extract all Whitespace characters in a build tool. Shipping with the usicode data file would be a large overhead.

          On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I have in the patch with the "rule" attribute to pick?

          I am not happy, but could live with it.

          Show
          Uwe Schindler added a comment - Why persist the bitset and deal with the build issues around that when instead it could be done in-memory in a static initializer? It's so cheap to build that I question the effort in pre-building it as part of the build process. Where to get the information for the bitset from? The Unicode data as of java.lang.Character class is wrong! My idea was to use a Unicode data file and extract all Whitespace characters in a build tool. Shipping with the usicode data file would be a large overhead. On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I have in the patch with the "rule" attribute to pick? I am not happy, but could live with it.
          Hide
          Steve Rowe added a comment -

          My idea was to use a Unicode data file and extract all Whitespace characters in a build tool. Shipping with the usicode data file would be a large overhead.

          The JFlex project has a similar requirement, but for many more properties than just Whitespace. JFlex includes a Maven plugin used by the build that parses Unicode data files via (you guessed it) JFlex scanners - here's the JFlex spec for the parser for binary property data files, including PropList.txt, which holds the Whitespace property definition: https://github.com/jflex-de/jflex/blob/master/jflex-unicode-maven-plugin/src/main/jflex/BinaryPropertiesFileScanner.flex

          Note: Unicode property names can have aliases, and "loose" matching is the recommended way to refer to them (see http://unicode.org/reports/tr18/#Categories ): match case-insensitively, and ignore whitespace, dashes, and underscores. PropList.txt gives the Whitespace property name as White_Space, and PropertyAliases.txt lists WSpace and space as aliases.

          Show
          Steve Rowe added a comment - My idea was to use a Unicode data file and extract all Whitespace characters in a build tool. Shipping with the usicode data file would be a large overhead. The JFlex project has a similar requirement, but for many more properties than just Whitespace. JFlex includes a Maven plugin used by the build that parses Unicode data files via (you guessed it) JFlex scanners - here's the JFlex spec for the parser for binary property data files, including PropList.txt , which holds the Whitespace property definition: https://github.com/jflex-de/jflex/blob/master/jflex-unicode-maven-plugin/src/main/jflex/BinaryPropertiesFileScanner.flex Note: Unicode property names can have aliases, and "loose" matching is the recommended way to refer to them (see http://unicode.org/reports/tr18/#Categories ): match case-insensitively, and ignore whitespace, dashes, and underscores. PropList.txt gives the Whitespace property name as White_Space , and PropertyAliases.txt lists WSpace and space as aliases.
          Hide
          Uwe Schindler added a comment - - edited

          Cool!

          So my idea would be to write a small tool in the analysis/commons/src/tools module that uses the current jflex JAR file in classpath and extract the values for the bitset from the JFlex JAR file by accessing this class: https://github.com/jflex-de/jflex/blob/master/jflex/src/main/java/jflex/unicode/data/Unicode_6_3.java

          In my opinion this would be more elegant than creating a huge jflex UnicodeWhitespaceTokenizer in a separate submodule, if we could just just a codegenerator that produces a BitSet to be used in a CharTokenizer subclass. CharTokenizer is thoroughly tested, so just feeding it with a bitset would be my preference.

          Would this work?

          EDIT: I would let a groovy script do this, no java code needed. Just run <groovy classpath="jflex.jar">....</groovy> and write the result to the UnicodeCharTokenizer file.

          Show
          Uwe Schindler added a comment - - edited Cool! So my idea would be to write a small tool in the analysis/commons/src/tools module that uses the current jflex JAR file in classpath and extract the values for the bitset from the JFlex JAR file by accessing this class: https://github.com/jflex-de/jflex/blob/master/jflex/src/main/java/jflex/unicode/data/Unicode_6_3.java In my opinion this would be more elegant than creating a huge jflex UnicodeWhitespaceTokenizer in a separate submodule, if we could just just a codegenerator that produces a BitSet to be used in a CharTokenizer subclass. CharTokenizer is thoroughly tested, so just feeding it with a bitset would be my preference. Would this work? EDIT: I would let a groovy script do this, no java code needed. Just run <groovy classpath="jflex.jar">....</groovy> and write the result to the UnicodeCharTokenizer file.
          Hide
          Steve Rowe added a comment -

          Would this work?

          Yes, but I think ICU4J is more authoritative, and already effectively pins the Unicode version used in Lucene, so I'd recommend going with it instead of JFlex to extract characters with property X.

          Show
          Steve Rowe added a comment - Would this work? Yes, but I think ICU4J is more authoritative, and already effectively pins the Unicode version used in Lucene, so I'd recommend going with it instead of JFlex to extract characters with property X.
          Hide
          Uwe Schindler added a comment -

          My idea was to create the whitespace chars as int[] array for every unicode version and allow WhitespaceTokenizer to specify the version. As we have the data files already, the same groovy code against jflex data (or ICU4J) could be used to build LetterTokenizer, too?

          This would also allw backwards compatibility. By default WhitespaceTokenizer would use latest Unicode version, unless you specify a version (as enum constant).

          Show
          Uwe Schindler added a comment - My idea was to create the whitespace chars as int[] array for every unicode version and allow WhitespaceTokenizer to specify the version. As we have the data files already, the same groovy code against jflex data (or ICU4J) could be used to build LetterTokenizer, too? This would also allw backwards compatibility. By default WhitespaceTokenizer would use latest Unicode version, unless you specify a version (as enum constant).
          Hide
          Uwe Schindler added a comment -

          ...hacking Groovy script using ICU4J as specified in ivy-versions.properties...

          Show
          Uwe Schindler added a comment - ...hacking Groovy script using ICU4J as specified in ivy-versions.properties...
          Hide
          Steve Rowe added a comment -

          My idea was to create the whitespace chars as int[] array for every unicode version

          Each ICU4J release only has data for the (single) Unicode version it was built against, so it won't work for this purpose.

          Show
          Steve Rowe added a comment - My idea was to create the whitespace chars as int[] array for every unicode version Each ICU4J release only has data for the (single) Unicode version it was built against, so it won't work for this purpose.
          Hide
          Uwe Schindler added a comment -

          Here is my ICU datasucker. The lines with System.println should now just write the class file

          Call with: ant unicode-tokenizers

          Show
          Uwe Schindler added a comment - Here is my ICU datasucker. The lines with System.println should now just write the class file Call with: ant unicode-tokenizers
          Hide
          Uwe Schindler added a comment -

          New simplified version, now printing unicode version (7.0 at moment with ICU4J as specified in the versions file).

          Show
          Uwe Schindler added a comment - New simplified version, now printing unicode version (7.0 at moment with ICU4J as specified in the versions file).
          Hide
          Uwe Schindler added a comment -

          Sorry my fault, must be UCharacter.isUWhitespace(), result is then:

             [groovy] Unicode version: 7.0.0.0
             [groovy] Whitespace: 9, 10, 11, 12, 13, 32, 133, 160, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 12288
          
          Show
          Uwe Schindler added a comment - Sorry my fault, must be UCharacter.isUWhitespace(), result is then: [groovy] Unicode version: 7.0.0.0 [groovy] Whitespace: 9, 10, 11, 12, 13, 32, 133, 160, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 12288
          Hide
          Steve Rowe added a comment -

          Uwe, you're using UCharacter,isWhitespace(), but that's the same as the problematic Java Character.isWhitespace() – note the exclusion of U+00A0 in your output Whitespace char list – http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html says what you want is isUWhiteSpace(c) or hasBinaryProperty(c, UProperty.WHITE_SPACE)

          Show
          Steve Rowe added a comment - Uwe, you're using UCharacter,isWhitespace(), but that's the same as the problematic Java Character.isWhitespace() – note the exclusion of U+00A0 in your output Whitespace char list – http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html says what you want is isUWhiteSpace(c) or hasBinaryProperty(c, UProperty.WHITE_SPACE)
          Hide
          Uwe Schindler added a comment -

          Sorry updated my post, recognized this a minute ago.

          Show
          Uwe Schindler added a comment - Sorry updated my post, recognized this a minute ago.
          Hide
          Uwe Schindler added a comment -

          now the right patch.

          Show
          Uwe Schindler added a comment - now the right patch.
          Hide
          Uwe Schindler added a comment - - edited

          Here is my patch with the UnicodeWhitespaceTokenizer, small & elegant.

          I have not added tests, I think we can take those from Steve Rowe. I did not check performance, too. It is time for bed now.

          To regenerate use ant regenerate or alternatively ant unicode-data.

          Show
          Uwe Schindler added a comment - - edited Here is my patch with the UnicodeWhitespaceTokenizer, small & elegant. I have not added tests, I think we can take those from Steve Rowe . I did not check performance, too. It is time for bed now. To regenerate use ant regenerate or alternatively ant unicode-data .
          Hide
          Uwe Schindler added a comment -

          Minor changes

          Show
          Uwe Schindler added a comment - Minor changes
          Hide
          David Smiley added a comment - - edited

          +1 I like it Uwe; nice job. Automating the generation of the unicode code points is good, and may come in handy if we want other whitespace rules/definitions.

          Show
          David Smiley added a comment - - edited +1 I like it Uwe; nice job. Automating the generation of the unicode code points is good, and may come in handy if we want other whitespace rules/definitions.
          Hide
          Uwe Schindler added a comment -

          New patch:

          • Fix formatting (use hex) and line breaks in generator
          • Move the bounds check to the returned Bitset impl
          • Move UnicodeData to oal.analysis.util package

          No tests until now, have to copy them from Steve Rowe's patch later.

          Show
          Uwe Schindler added a comment - New patch: Fix formatting (use hex) and line breaks in generator Move the bounds check to the returned Bitset impl Move UnicodeData to oal.analysis.util package No tests until now, have to copy them from Steve Rowe 's patch later.
          Hide
          Uwe Schindler added a comment -

          Cleanup.

          In trunk you now can create the UnicodeWhitespaceTokenizer with the following code:

          CharTokenizer.fromSeparatorCharPredicate(UnicodeProps.WHITESPACE::get);
          

          So the extra class is theoretically obsolete, but provided for convenience (at least required for 5.x).

          Show
          Uwe Schindler added a comment - Cleanup. In trunk you now can create the UnicodeWhitespaceTokenizer with the following code: CharTokenizer.fromSeparatorCharPredicate(UnicodeProps.WHITESPACE::get); So the extra class is theoretically obsolete, but provided for convenience (at least required for 5.x).
          Hide
          Uwe Schindler added a comment -

          LUCENE-6874-chartokenizer.patch is now the merged patch:

          • Contains the WhitespaceTokenizerFactory from David Smiley, but without maxTokenLength (was specific to jflex)
          • Adds the tests by Steve Rowe
          • The ICUWhitespaceTokenizer is gone (as Robert Muir said)
          • Modifies the ALG file to exclude this tokenizer. I have no yet tested the ALG file. I also did no performance test up to now.

          I like the new solution more than the Jflex-based tokenizer, because it is smaller and maybe also faster (not yet tested). Using JFlex for this is not really a good idea, because the tokenization algorithm is very simple. CharTokenizer is the better match for that, thoroughly tested.

          Show
          Uwe Schindler added a comment - LUCENE-6874 -chartokenizer.patch is now the merged patch: Contains the WhitespaceTokenizerFactory from David Smiley , but without maxTokenLength (was specific to jflex) Adds the tests by Steve Rowe The ICUWhitespaceTokenizer is gone (as Robert Muir said) Modifies the ALG file to exclude this tokenizer. I have no yet tested the ALG file. I also did no performance test up to now. I like the new solution more than the Jflex-based tokenizer, because it is smaller and maybe also faster (not yet tested). Using JFlex for this is not really a good idea, because the tokenization algorithm is very simple. CharTokenizer is the better match for that, thoroughly tested.
          Hide
          Uwe Schindler added a comment -

          Adds more tests (random strings) and adds UnicodeWhitespaceAnalyzer (for consistency & to test).

          Show
          Uwe Schindler added a comment - Adds more tests (random strings) and adds UnicodeWhitespaceAnalyzer (for consistency & to test).
          Hide
          Uwe Schindler added a comment - - edited

          Here is the output of the reuters test:

          ------------> Report Sum By (any) Name and Round (28 about 33 out of 34)
          Operation                                                    round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
          AnalyzerFactory(name:WhitespaceTokenizer,WhitespaceTokenizer(rule:java))               0        1            0         0.00        0.00     9,569,344    124,256,256
          AnalyzerFactory(name:UnicodeWhitespaceTokenizer,WhitespaceTokenizer(rule:unicode)) -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -   9,569,344 -  124,256,256
          Rounds_5                                                      0        1     24493540   360,841.19       67.88    16,566,472    124,256,256
          NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -   9,569,344 -  124,256,256
          [Character.isWhitespace()] WhitespaceTokenizer                                         0        1      2449354   331,038.53        7.40    22,121,256    124,256,256
          Seq_20000 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 0 -  -   2 -  - 2449354 - 344,131.22 -  -  14.23 -  22,121,256 -  118,489,088
          NewAnalyzer(UnicodeWhitespaceTokenizer)                                                0        1            0         0.00        0.00    22,121,256    112,721,920
          [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  -  -   0 -  -   1 -  - 2449354 - 358,302.22 -  -   6.84 -  22,121,256 -  112,721,920
          NewAnalyzer(WhitespaceTokenizer)                                                      1        1            0         0.00        0.00    12,138,024    112,721,920
          [Character.isWhitespace()] WhitespaceTokenizer -  -  -  -  -  -  -  -  -  -  -  -  -   1 -  -   1 -  - 2449354 - 366,724.66 -  -   6.68 -  22,374,536 -  112,721,920
          Seq_20000                                                      1        2      2449354   365,139.25       13.42    27,477,352    117,702,656
          NewAnalyzer(UnicodeWhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 1 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  22,374,536 -  111,673,344
          [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer                             1        1      2449354   363,567.47        6.74    32,580,168    122,683,392
          NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  32,580,168 -  122,683,392
          [Character.isWhitespace()] WhitespaceTokenizer                                         2        1      2449354   365,793.59        6.70    33,461,280    122,683,392
          Seq_20000 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 2 -  -   2 -  - 2449354 - 365,112.03 -  -  13.42 -  33,461,280 -  117,178,368
          NewAnalyzer(UnicodeWhitespaceTokenizer)                                                2        1            0         0.00        0.00    33,461,280    111,673,344
          [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  -  -   2 -  -   1 -  - 2449354 - 364,432.97 -  -   6.72 -  33,461,280 -  111,673,344
          NewAnalyzer(WhitespaceTokenizer)                                                      3        1            0         0.00        0.00    10,836,464    111,673,344
          [Character.isWhitespace()] WhitespaceTokenizer -  -  -  -  -  -  -  -  -  -  -  -  -   3 -  -   1 -  - 2449354 - 367,660.47 -  -   6.66 -  12,451,400 -  111,673,344
          Seq_20000                                                      3        2      2449354   365,820.94       13.39    13,235,672    111,673,344
          NewAnalyzer(UnicodeWhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 3 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  12,451,400 -  111,673,344
          [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer                             3        1      2449354   363,999.69        6.73    14,019,944    111,673,344
          NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  14,019,944 -  111,673,344
          [Character.isWhitespace()] WhitespaceTokenizer                                         4        1      2449354   367,329.62        6.67    15,061,368    111,673,344
          Seq_20000 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 4 -  -   2 -  - 2449354 - 365,057.59 -  -  13.42 -  15,813,920 -  111,673,344
          NewAnalyzer(UnicodeWhitespaceTokenizer)                                                4        1            0         0.00        0.00    15,061,368    111,673,344
          [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  -  -   4 -  -   1 -  - 2449354 - 362,813.50 -  -   6.75 -  16,566,472 -  111,673,344
          

          As you see, both Tokenizers are almost same speed. I was a little bit afraid that SparceFixedBitset could be slowing down, but it is behaving very nice with having a small memory footprint.

          Show
          Uwe Schindler added a comment - - edited Here is the output of the reuters test: ------------> Report Sum By (any) Name and Round (28 about 33 out of 34) Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem AnalyzerFactory(name:WhitespaceTokenizer,WhitespaceTokenizer(rule:java)) 0 1 0 0.00 0.00 9,569,344 124,256,256 AnalyzerFactory(name:UnicodeWhitespaceTokenizer,WhitespaceTokenizer(rule:unicode)) - 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 - 124,256,256 Rounds_5 0 1 24493540 360,841.19 67.88 16,566,472 124,256,256 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 - 124,256,256 [Character.isWhitespace()] WhitespaceTokenizer 0 1 2449354 331,038.53 7.40 22,121,256 124,256,256 Seq_20000 - - - - - - - - - - - - - - - - - - 0 - - 2 - - 2449354 - 344,131.22 - - 14.23 - 22,121,256 - 118,489,088 NewAnalyzer(UnicodeWhitespaceTokenizer) 0 1 0 0.00 0.00 22,121,256 112,721,920 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 0 - - 1 - - 2449354 - 358,302.22 - - 6.84 - 22,121,256 - 112,721,920 NewAnalyzer(WhitespaceTokenizer) 1 1 0 0.00 0.00 12,138,024 112,721,920 [Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - - - - 1 - - 1 - - 2449354 - 366,724.66 - - 6.68 - 22,374,536 - 112,721,920 Seq_20000 1 2 2449354 365,139.25 13.42 27,477,352 117,702,656 NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - - - - - 1 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 22,374,536 - 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer 1 1 2449354 363,567.47 6.74 32,580,168 122,683,392 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 2 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 32,580,168 - 122,683,392 [Character.isWhitespace()] WhitespaceTokenizer 2 1 2449354 365,793.59 6.70 33,461,280 122,683,392 Seq_20000 - - - - - - - - - - - - - - - - - - 2 - - 2 - - 2449354 - 365,112.03 - - 13.42 - 33,461,280 - 117,178,368 NewAnalyzer(UnicodeWhitespaceTokenizer) 2 1 0 0.00 0.00 33,461,280 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 2 - - 1 - - 2449354 - 364,432.97 - - 6.72 - 33,461,280 - 111,673,344 NewAnalyzer(WhitespaceTokenizer) 3 1 0 0.00 0.00 10,836,464 111,673,344 [Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - - - - 3 - - 1 - - 2449354 - 367,660.47 - - 6.66 - 12,451,400 - 111,673,344 Seq_20000 3 2 2449354 365,820.94 13.39 13,235,672 111,673,344 NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - - - - - 3 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 12,451,400 - 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer 3 1 2449354 363,999.69 6.73 14,019,944 111,673,344 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 4 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 14,019,944 - 111,673,344 [Character.isWhitespace()] WhitespaceTokenizer 4 1 2449354 367,329.62 6.67 15,061,368 111,673,344 Seq_20000 - - - - - - - - - - - - - - - - - - 4 - - 2 - - 2449354 - 365,057.59 - - 13.42 - 15,813,920 - 111,673,344 NewAnalyzer(UnicodeWhitespaceTokenizer) 4 1 0 0.00 0.00 15,061,368 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 4 - - 1 - - 2449354 - 362,813.50 - - 6.75 - 16,566,472 - 111,673,344 As you see, both Tokenizers are almost same speed. I was a little bit afraid that SparceFixedBitset could be slowing down, but it is behaving very nice with having a small memory footprint.
          Hide
          Uwe Schindler added a comment -

          Cleanup of benchmake. To me it looks OK. Comments from others?

          Show
          Uwe Schindler added a comment - Cleanup of benchmake. To me it looks OK. Comments from others?
          Hide
          David Smiley added a comment -

          +1 Patch is good Uwe.

          Show
          David Smiley added a comment - +1 Patch is good Uwe.
          Hide
          Uwe Schindler added a comment -

          If nobody objects, I will commit this tomorrow.

          Show
          Uwe Schindler added a comment - If nobody objects, I will commit this tomorrow.
          Hide
          ASF subversion and git services added a comment -

          Commit 1714354 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1714354 ]

          LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common that uses Unicode character properties extracted from ICU4J to tokenize text on whitespace

          Show
          ASF subversion and git services added a comment - Commit 1714354 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1714354 ] LUCENE-6874 : Add a new UnicodeWhitespaceTokenizer to analysis/common that uses Unicode character properties extracted from ICU4J to tokenize text on whitespace
          Hide
          ASF subversion and git services added a comment -

          Commit 1714355 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1714355 ]

          Merged revision(s) 1714354 from lucene/dev/trunk:
          LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common that uses Unicode character properties extracted from ICU4J to tokenize text on whitespace

          Show
          ASF subversion and git services added a comment - Commit 1714355 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1714355 ] Merged revision(s) 1714354 from lucene/dev/trunk: LUCENE-6874 : Add a new UnicodeWhitespaceTokenizer to analysis/common that uses Unicode character properties extracted from ICU4J to tokenize text on whitespace
          Hide
          Uwe Schindler added a comment -

          Thank for the fruitful discussions! I hope Steve Rowe is not unhappy because we did not use jflex for this simple case and instead use Unicode data with the already existing CharTokenizer.

          Show
          Uwe Schindler added a comment - Thank for the fruitful discussions! I hope Steve Rowe is not unhappy because we did not use jflex for this simple case and instead use Unicode data with the already existing CharTokenizer.
          Hide
          Steve Rowe added a comment -

          Thank for the fruitful discussions! I hope Steve Rowe is not unhappy because we did not use jflex for this simple case and instead use Unicode data with the already existing CharTokenizer.

          No worries, +1 to your work Uwe. You've also laid the groundwork for future simple Unicode property-based char tokenizers, which is nice. Thanks!

          Show
          Steve Rowe added a comment - Thank for the fruitful discussions! I hope Steve Rowe is not unhappy because we did not use jflex for this simple case and instead use Unicode data with the already existing CharTokenizer. No worries, +1 to your work Uwe. You've also laid the groundwork for future simple Unicode property-based char tokenizers, which is nice. Thanks!

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development