Lucene - Core
  1. Lucene - Core
  2. LUCENE-2167

Implement StandardTokenizer with the UAX#29 Standard

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.

      Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

      This should be a good tokenizer for most European-language documents

      The new StandardTokenizer could then say

      This should be a good tokenizer for most languages.

      All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

      1. LUCENE-2167.benchmark.patch
        34 kB
        Steve Rowe
      2. LUCENE-2167.benchmark.patch
        33 kB
        Steve Rowe
      3. LUCENE-2167.benchmark.patch
        31 kB
        Steve Rowe
      4. LUCENE-2167.patch
        885 kB
        Steve Rowe
      5. LUCENE-2167.patch
        831 kB
        Steve Rowe
      6. LUCENE-2167.patch
        874 kB
        Steve Rowe
      7. LUCENE-2167.patch
        887 kB
        Steve Rowe
      8. LUCENE-2167.patch
        588 kB
        Steve Rowe
      9. LUCENE-2167.patch
        529 kB
        Robert Muir
      10. LUCENE-2167.patch
        812 kB
        Robert Muir
      11. LUCENE-2167.patch
        746 kB
        Steve Rowe
      12. LUCENE-2167.patch
        859 kB
        Steve Rowe
      13. LUCENE-2167.patch
        53 kB
        Steve Rowe
      14. LUCENE-2167.patch
        50 kB
        Steve Rowe
      15. LUCENE-2167.patch
        50 kB
        Steve Rowe
      16. LUCENE-2167.patch
        49 kB
        Steve Rowe
      17. LUCENE-2167.patch
        47 kB
        Steve Rowe
      18. LUCENE-2167.patch
        46 kB
        Steve Rowe
      19. LUCENE-2167.patch
        56 kB
        Steve Rowe
      20. LUCENE-2167.patch
        56 kB
        Steve Rowe
      21. LUCENE-2167.patch
        2 kB
        Shyamal Prasad
      22. LUCENE-2167.patch
        3 kB
        Shyamal Prasad
      23. LUCENE-2167-jflex-tld-macro-gen.patch
        14 kB
        Uwe Schindler
      24. LUCENE-2167-jflex-tld-macro-gen.patch
        14 kB
        Uwe Schindler
      25. LUCENE-2167-jflex-tld-macro-gen.patch
        14 kB
        Uwe Schindler
      26. LUCENE-2167-lucene-buildhelper-maven-plugin.patch
        39 kB
        Steve Rowe
      27. standard.zip
        162 kB
        Robert Muir
      28. StandardTokenizerImpl.jflex
        14 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Shyamal Prasad added a comment -

          Patch fixes Javadoc with suggested text, adds test cases to motivate change.

          Show
          Shyamal Prasad added a comment - Patch fixes Javadoc with suggested text, adds test cases to motivate change.
          Hide
          Robert Muir added a comment -

          Hi Shyamal, I am not sure we should document this behavior, but instead improve standard analyzer.

          Like you said it is hard to make everyone happy, but we now have a mechanism to improve things, that is based on that Version constant you provide.
          For example, in a future release we hope to be able to use Jflex 1.5, which has greatly improved unicode support.

          you can try your examples against unicode segmentation standards here to get a preview of what this might look like: http://unicode.org/cldr/utility/breaks.jsp

          Show
          Robert Muir added a comment - Hi Shyamal, I am not sure we should document this behavior, but instead improve standard analyzer. Like you said it is hard to make everyone happy, but we now have a mechanism to improve things, that is based on that Version constant you provide. For example, in a future release we hope to be able to use Jflex 1.5, which has greatly improved unicode support. you can try your examples against unicode segmentation standards here to get a preview of what this might look like: http://unicode.org/cldr/utility/breaks.jsp
          Hide
          Shyamal Prasad added a comment -

          Hi Robert, I presume that when you say we should "instead improve standard analyzer" you mean the code should work more like the original Javadoc states it should? Or are you suggesting that moving to Jflex 1.5

          The problem I observed was that the current JFlex rules don't implement what the Javadoc says is the behavior of the tokenizer. I'd be happy to spend some time on this if I could get some direction on where I should focus.

          Show
          Shyamal Prasad added a comment - Hi Robert, I presume that when you say we should "instead improve standard analyzer" you mean the code should work more like the original Javadoc states it should? Or are you suggesting that moving to Jflex 1.5 The problem I observed was that the current JFlex rules don't implement what the Javadoc says is the behavior of the tokenizer. I'd be happy to spend some time on this if I could get some direction on where I should focus.
          Hide
          Robert Muir added a comment -

          Hi Robert, I presume that when you say we should "instead improve standard analyzer" you mean the code should work more like the original Javadoc states it should?

          Shyamal I guess what I am saying is I would prefer the javadoc of StandardTokenizer to be a little vague as to exactly what it does.
          I would actually prefer it have less details than it currently has: in my opinion it starts getting into nitty-gritty details of what could be considered Version-specific.

          I'd be happy to spend some time on this if I could get some direction on where I should focus.

          If you have fixes to the grammar, I would prefer this over 'documenting buggy behavior'. LUCENE-2074 gives us the capability to fix bugs without breaking backwards compatibility.

          Show
          Robert Muir added a comment - Hi Robert, I presume that when you say we should "instead improve standard analyzer" you mean the code should work more like the original Javadoc states it should? Shyamal I guess what I am saying is I would prefer the javadoc of StandardTokenizer to be a little vague as to exactly what it does. I would actually prefer it have less details than it currently has: in my opinion it starts getting into nitty-gritty details of what could be considered Version-specific. I'd be happy to spend some time on this if I could get some direction on where I should focus. If you have fixes to the grammar, I would prefer this over 'documenting buggy behavior'. LUCENE-2074 gives us the capability to fix bugs without breaking backwards compatibility.
          Hide
          Shyamal Prasad added a comment -

          Hi Robert,

          It's been a while but I finally got around to working on the grammar. Clearly, much of this is an opinion, so I finally stuck to the one minor change that I believe is arguably an improvement. Previously comma separated fields containing digits would be mistaken for numbers and combined into a single token. I believe this is a mistake because part numbers etc. are rarely comma separated, and regular text that is comma separated is not uncommon. This is also the problem I ran into in real life when using Lucene

          This patch stops treating comma separated tokens as numbers when they contain digits.

          I did not included the patched Java file since I don't know what JFlex version I should use to create it (I used JFlex 1.4.3, and test-tag passes with JDK 1.5/1.6; I presume the Java 1.4 compatibility comment in the generated file is now history?).

          Let me know if this is headed in a useful direction.

          Cheers!
          Shyamal

          Show
          Shyamal Prasad added a comment - Hi Robert, It's been a while but I finally got around to working on the grammar. Clearly, much of this is an opinion, so I finally stuck to the one minor change that I believe is arguably an improvement. Previously comma separated fields containing digits would be mistaken for numbers and combined into a single token. I believe this is a mistake because part numbers etc. are rarely comma separated, and regular text that is comma separated is not uncommon. This is also the problem I ran into in real life when using Lucene This patch stops treating comma separated tokens as numbers when they contain digits. I did not included the patched Java file since I don't know what JFlex version I should use to create it (I used JFlex 1.4.3, and test-tag passes with JDK 1.5/1.6; I presume the Java 1.4 compatibility comment in the generated file is now history?). Let me know if this is headed in a useful direction. Cheers! Shyamal
          Hide
          Robert Muir added a comment -

          Clearly, much of this is an opinion, so I finally stuck to the one minor change that I believe is arguably an improvement. Previously comma separated fields containing digits would be mistaken for numbers and combined into a single token. I believe this is a mistake because part numbers etc. are rarely comma separated, and regular text that is comma separated is not uncommon.

          I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard.

          In this example, i think the change would be bad, as the comma is treated differently depending upon context, as it is a decimal separator and thousands separator in many languages, including English. so, the treatment of the comma depends upon the previous character.

          this is why in unicode, the comma has the Mid_Num Word_Break property.

          Show
          Robert Muir added a comment - Clearly, much of this is an opinion, so I finally stuck to the one minor change that I believe is arguably an improvement. Previously comma separated fields containing digits would be mistaken for numbers and combined into a single token. I believe this is a mistake because part numbers etc. are rarely comma separated, and regular text that is comma separated is not uncommon. I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard. In this example, i think the change would be bad, as the comma is treated differently depending upon context, as it is a decimal separator and thousands separator in many languages, including English. so, the treatment of the comma depends upon the previous character. this is why in unicode, the comma has the Mid_Num Word_Break property.
          Hide
          Shyamal Prasad added a comment -

          I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard.

          Yep, I see I am going for the wrong ambition level and only tweaking the existing grammar. I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance. I see your point.

          Cheers!
          Shyamal

          Show
          Shyamal Prasad added a comment - I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard. Yep, I see I am going for the wrong ambition level and only tweaking the existing grammar. I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance. I see your point. Cheers! Shyamal
          Hide
          Robert Muir added a comment -

          I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance.

          I would love it if you could produce a grammar that implemented UAX#29!

          If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If I thought I could do it correctly, I would have already done it, as the support for the unicode properties needed to do this is now in the trunk of Jflex!

          here are some references that might help:
          The standard itself: http://unicode.org/reports/tr29/

          particularly the "Testing" portion: http://unicode.org/reports/tr41/tr41-5.html#Tests29

          Unicode provides a WordBreakTest.txt file, that we could use from Junit, to help verify correctness: http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt

          I'll warn you I think it might be hard, but perhaps its not that bad. In particular the standard is defined in terms of "chained" rules, and Jflex doesnt support rule chaining, but I am not convinced we need rule chaining to implement WordBreak (maybe LineBreak, but maybe WordBreak can be done easily without it?)

          Steven Rowe is the expert on this stuff, maybe he has some ideas.

          Show
          Robert Muir added a comment - I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance. I would love it if you could produce a grammar that implemented UAX#29! If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If I thought I could do it correctly, I would have already done it, as the support for the unicode properties needed to do this is now in the trunk of Jflex! here are some references that might help: The standard itself: http://unicode.org/reports/tr29/ particularly the "Testing" portion: http://unicode.org/reports/tr41/tr41-5.html#Tests29 Unicode provides a WordBreakTest.txt file, that we could use from Junit, to help verify correctness: http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt I'll warn you I think it might be hard, but perhaps its not that bad. In particular the standard is defined in terms of "chained" rules, and Jflex doesnt support rule chaining, but I am not convinced we need rule chaining to implement WordBreak (maybe LineBreak, but maybe WordBreak can be done easily without it?) Steven Rowe is the expert on this stuff, maybe he has some ideas.
          Hide
          Robert Muir added a comment -

          btw, here is some statement that seems to confirm my suspicions, from the standard:

          In section 6.3, there is an example of the grapheme cluster boundaries converted into a simple regex (the kind we could do easily in jflex now that it has the properties available).

          They make this statement: Such a regular expression can also be turned into a fast, deterministic finite-state machine. Similar regular expressions are possible for Word boundaries. Line and Sentence boundaries are more complicated, and more difficult to represent with regular expressions.

          Show
          Robert Muir added a comment - btw, here is some statement that seems to confirm my suspicions, from the standard: In section 6.3, there is an example of the grapheme cluster boundaries converted into a simple regex (the kind we could do easily in jflex now that it has the properties available). They make this statement: Such a regular expression can also be turned into a fast, deterministic finite-state machine. Similar regular expressions are possible for Word boundaries. Line and Sentence boundaries are more complicated, and more difficult to represent with regular expressions.
          Hide
          Steve Rowe added a comment - - edited

          I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here:

          http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

          The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.)

          The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax.

          Show
          Steve Rowe added a comment - - edited I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here: http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/ The files are UnicodeWordBreakRules_5_*.* - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.) The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax.
          Hide
          Robert Muir added a comment -

          Steven, thanks for providing the link.

          I guess this is the point where I also say, I think it would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex (I realize in 1.5, we won't have > 0xffff support). Then its name would actually make sense.

          In my opinion, such a transition would involve something like renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

          This should be a good tokenizer for most European-language documents
          

          The new StandardTokenizer could then say

          This should be a good tokenizer for most languages.
          

          All the english/euro-centric stuff like the acronym/company/apostrophe stuff could stay with that "EuropeanTokenizer" or whatever its called, and it could be used by the european analyzers.

          but if we implement the Unicode rules, I think we should drop all this english/euro-centric stuff for StandardTokenizer. Otherwise it should be called StandardishTokenizer.

          we can obviously preserve the backwards compat with Version, as Uwe has created a way to use a different grammar for a different Version.

          I expect some -1 to this, waiting comments

          Show
          Robert Muir added a comment - Steven, thanks for providing the link. I guess this is the point where I also say, I think it would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex (I realize in 1.5, we won't have > 0xffff support). Then its name would actually make sense. In my opinion, such a transition would involve something like renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: This should be a good tokenizer for most European-language documents The new StandardTokenizer could then say This should be a good tokenizer for most languages. All the english/euro-centric stuff like the acronym/company/apostrophe stuff could stay with that "EuropeanTokenizer" or whatever its called, and it could be used by the european analyzers. but if we implement the Unicode rules, I think we should drop all this english/euro-centric stuff for StandardTokenizer. Otherwise it should be called StandardishTokenizer . we can obviously preserve the backwards compat with Version, as Uwe has created a way to use a different grammar for a different Version. I expect some -1 to this, waiting comments
          Hide
          Shyamal Prasad added a comment -

          Robert Muir wrote:

          I would love it if you could produce a grammar that implemented UAX#29!

          If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If I thought I could do it correctly, I would have already done it, as the support for the unicode properties needed to do this is now in the trunk of Jflex!

          I'm not smart enough to know if I should even try to do it at all (leave alone correctly), but am always willing to learn! Thanks for the references, I will certainly give it an honest try.

          /Shyamal

          Show
          Shyamal Prasad added a comment - Robert Muir wrote: I would love it if you could produce a grammar that implemented UAX#29! If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If I thought I could do it correctly, I would have already done it, as the support for the unicode properties needed to do this is now in the trunk of Jflex! I'm not smart enough to know if I should even try to do it at all (leave alone correctly), but am always willing to learn! Thanks for the references, I will certainly give it an honest try. /Shyamal
          Hide
          Steve Rowe added a comment -

          (stole Robert's comment to change the issue description)

          Show
          Steve Rowe added a comment - (stole Robert's comment to change the issue description)
          Hide
          Steve Rowe added a comment -

          Patch implementing a UAX#29 tokenizer, along with most of Robert's TestICUTokenizer tests (left out tests for Thai, Lao, and breaking at 4K chars, none of which are features of this tokenizer) - I re-upcased the downcased expected terms, and un-normalized the trailing greek lowercase sigma one of the expected terms in testGreek().

          Show
          Steve Rowe added a comment - Patch implementing a UAX#29 tokenizer, along with most of Robert's TestICUTokenizer tests (left out tests for Thai, Lao, and breaking at 4K chars, none of which are features of this tokenizer) - I re-upcased the downcased expected terms, and un-normalized the trailing greek lowercase sigma one of the expected terms in testGreek().
          Hide
          Steve Rowe added a comment -

          I want to test performance relative to StandardTokenizer and ICUTokenizer, and also consider switching from lookahead chaining to single regular expression per term type to improve performance.

          Show
          Steve Rowe added a comment - I want to test performance relative to StandardTokenizer and ICUTokenizer, and also consider switching from lookahead chaining to single regular expression per term type to improve performance.
          Hide
          Steve Rowe added a comment -

          I ran contrib/benchmark over 10k Reuters docs with tokenization-only analyzers; Sun JDK 1.6, Windows Vista/Cygwin; best of five:

          Operation recsPerRun rec/s elapsedSec
          StandardTokenizer 1262799 655,318.62 1.93
          ICUTokenizer 1268451 536,116.25 2.37
          UAX29Tokenizer 1268451 524,586.88 2.42

          I think UAX29Tokenizer is slower than StandardTokenizer because it does the lookahead/chaining thing. Still, decent performance.

          Show
          Steve Rowe added a comment - I ran contrib/benchmark over 10k Reuters docs with tokenization-only analyzers; Sun JDK 1.6, Windows Vista/Cygwin; best of five: Operation recsPerRun rec/s elapsedSec StandardTokenizer 1262799 655,318.62 1.93 ICUTokenizer 1268451 536,116.25 2.37 UAX29Tokenizer 1268451 524,586.88 2.42 I think UAX29Tokenizer is slower than StandardTokenizer because it does the lookahead/chaining thing. Still, decent performance.
          Hide
          Robert Muir added a comment -

          Hi Steve, this is great progress!

          Looking at the code/perf, is there anyway to avoid the CharBuffer.wrap calls in updateAttributes()?

          It seems since you are just appending, it might be better to use some "append" like:

          int newLength = termAtt.length() + <length of text you are appending from zzBuffer>)
          char bufferWithRoom[] = termAtt.resizeBuffer(newLength);
          System.arrayCopy(from zzBuffer into bufferWithRoom, starting at termAtt.length());
          termAtt.setLength(newLength);
          
          Show
          Robert Muir added a comment - Hi Steve, this is great progress! Looking at the code/perf, is there anyway to avoid the CharBuffer.wrap calls in updateAttributes()? It seems since you are just appending, it might be better to use some "append" like: int newLength = termAtt.length() + <length of text you are appending from zzBuffer>) char bufferWithRoom[] = termAtt.resizeBuffer(newLength); System.arrayCopy(from zzBuffer into bufferWithRoom, starting at termAtt.length()); termAtt.setLength(newLength);
          Hide
          Steve Rowe added a comment - - edited

          I added your change removing CharBuffer.wrap(), Robert, and it appears to have sped it up, though not as much as I would like:

          Operation recsPerRun rec/s elapsedSec
          StandardTokenizer 1262799 647,589.23 1.95
          ICUTokenizer 1268451 526,328.22 2.41
          UAX29Tokenizer 1268451 558,788.99 2.27

          I plan on attempting to rewrite the grammar to eliminate chaining/lookahead this weekend.

          edit: fixed the rec/s, which were from the worst of five instead of the best of five - the elapsedSec, however, were correct.

          Show
          Steve Rowe added a comment - - edited I added your change removing CharBuffer.wrap(), Robert, and it appears to have sped it up, though not as much as I would like: Operation recsPerRun rec/s elapsedSec StandardTokenizer 1262799 647,589.23 1.95 ICUTokenizer 1268451 526,328.22 2.41 UAX29Tokenizer 1268451 558,788.99 2.27 I plan on attempting to rewrite the grammar to eliminate chaining/lookahead this weekend. edit : fixed the rec/s, which were from the worst of five instead of the best of five - the elapsedSec, however, were correct.
          Hide
          Steve Rowe added a comment -

          Attached a patch that removes lookahead/chaining. All tests pass.

          UAX29Tokenizer is now in the same ballpark performance-wise as StandardTokenizer:

          Operation recsPerRun rec/s elapsedSec
          StandardTokenizer 1262799 658,737.06 1.92
          ICUTokenizer 1268451 542,768.94 2.34
          UAX29Tokenizer 1268451 668,661.56 1.90
          Show
          Steve Rowe added a comment - Attached a patch that removes lookahead/chaining. All tests pass. UAX29Tokenizer is now in the same ballpark performance-wise as StandardTokenizer: Operation recsPerRun rec/s elapsedSec StandardTokenizer 1262799 658,737.06 1.92 ICUTokenizer 1268451 542,768.94 2.34 UAX29Tokenizer 1268451 668,661.56 1.90
          Hide
          Robert Muir added a comment -

          Hi Steven: this is impressive progress!

          What do you think the next steps should be?

          • should we look at any tailorings to this? The first thing that comes to mind is full-width forms, which have no WordBreak property
          • is it simple, or would it be messy, to apply this to the existing grammar (English/EuroTokenizer)? Another way to say it, is it possible for
            English/EuroTokenizer (StandardTokenizer today) to instead be a tailoring to UAX#29, for companies,acronym, etc, such that if it encounters
            say some hindi or thai text it will behave better?
          Show
          Robert Muir added a comment - Hi Steven: this is impressive progress! What do you think the next steps should be? should we look at any tailorings to this? The first thing that comes to mind is full-width forms, which have no WordBreak property is it simple, or would it be messy, to apply this to the existing grammar (English/EuroTokenizer)? Another way to say it, is it possible for English/EuroTokenizer (StandardTokenizer today) to instead be a tailoring to UAX#29, for companies,acronym, etc, such that if it encounters say some hindi or thai text it will behave better?
          Hide
          Steve Rowe added a comment -

          should we look at any tailorings to this? The first thing that comes to mind is full-width forms, which have no WordBreak property

          Looks like Latin full-width letters are included (from http://www.unicode.org/Public/5.2.0/ucd/auxiliary/WordBreakProperty.txt):

          FF21..FF3A ; ALetter # L& [26] FULLWIDTH LATIN CAPITAL LETTER A..FULLWIDTH LATIN CAPITAL LETTER Z
          FF41..FF5A ; ALetter # L& [26] FULLWIDTH LATIN SMALL LETTER A..FULLWIDTH LATIN SMALL LETTER Z

          But as you mention in a code comment in TestICUTokenizer, there are no full-width WordBreak:Numeric characters, so we could just add these to the

          {NumericEx}

          macro, I think.

          Was there anything else you were thinking of?

          is it simple, or would it be messy, to apply this to the existing grammar (English/EuroTokenizer)? Another way to say it, is it possible for English/EuroTokenizer (StandardTokenizer today) to instead be a tailoring to UAX#29, for companies,acronym, etc, such that if it encounters say some hindi or thai text it will behave better?

          Not sure about difficulty level, but it should be possible.

          Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.

          What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?)

          I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?

          Show
          Steve Rowe added a comment - should we look at any tailorings to this? The first thing that comes to mind is full-width forms, which have no WordBreak property Looks like Latin full-width letters are included (from http://www.unicode.org/Public/5.2.0/ucd/auxiliary/WordBreakProperty.txt): FF21..FF3A ; ALetter # L& [26] FULLWIDTH LATIN CAPITAL LETTER A..FULLWIDTH LATIN CAPITAL LETTER Z FF41..FF5A ; ALetter # L& [26] FULLWIDTH LATIN SMALL LETTER A..FULLWIDTH LATIN SMALL LETTER Z But as you mention in a code comment in TestICUTokenizer, there are no full-width WordBreak:Numeric characters, so we could just add these to the {NumericEx} macro, I think. Was there anything else you were thinking of? is it simple, or would it be messy, to apply this to the existing grammar (English/EuroTokenizer)? Another way to say it, is it possible for English/EuroTokenizer (StandardTokenizer today) to instead be a tailoring to UAX#29, for companies,acronym, etc, such that if it encounters say some hindi or thai text it will behave better? Not sure about difficulty level, but it should be possible. Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies. What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?) I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?
          Hide
          Steve Rowe added a comment -

          One other thing, Robert: what do you think of adding URL tokenization?

          I'm not sure whether it's more useful to have the domain and path components separately tokenized. But maybe if someone wants that, they could add a filter to decompose?

          It would be impossible to do post-tokenization composition to get back the original URL, however, so I'm leaning toward adding URL tokenization.

          Show
          Steve Rowe added a comment - One other thing, Robert: what do you think of adding URL tokenization? I'm not sure whether it's more useful to have the domain and path components separately tokenized. But maybe if someone wants that, they could add a filter to decompose? It would be impossible to do post-tokenization composition to get back the original URL, however, so I'm leaning toward adding URL tokenization.
          Hide
          Robert Muir added a comment -

          But as you mention in a code comment in TestICUTokenizer, there are no full-width WordBreak:Numeric characters, so we could just add these to the

          Unknown macro: {NumericEx}

          macro, I think.

          Was there anything else you were thinking of?

          No, that's it!

          Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.

          What valid constituencies do you refer to? In general the acronym,company,possessive stuff here are very english/euro-specific.
          Bugs in JIRA get opened if it doesn't do this stuff right on english, but it doesn't even work at all for a lot of languages.
          Personally I think its great to rip this stuff out of what should be a "default" language-independent tokenizer based on
          standards (StandardTokenizer), and put it into the language-specific package that it belongs. Otherwise we have to
          worry about these sort of things overriding and screwing up UAX#29 rules for words in real languages.

          What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?)

          It gets a little tricky: we should be careful about how we interpret what is "reasonable" for a language-independent default tokenizer.
          I think its "enough" to output the best indexing unit that is possible and relatively unambiguous to identify. I think this is a shortcut
          we can make, because we are trying to tokenize things for information retrieval, not for other purposes. The approach for Lao,
          Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as indexing unit, since words are ambiguous. Thai is based
          on words, not syllables, in ICUTokenizer, which is inconsistent from this, but we get this for free, so its just a laziness thing.

          By the way: none of those syllable-grammars in ICUTokenizer used chained rules, so you are welcome to steal what you want!

          I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?

          Well, either way I again strongly feel this logic should be tied into "Standard" tokenizer, so that it has better unicode behavior. I think
          it makes sense for us to have a reasonable, language-independent, standards-based tokenizer that works well for most languages.
          I think it also makes sense to have English/Euro-centric stuff thats language-specific, sitting in the analysis.en package just like we
          do with other languages.

          Show
          Robert Muir added a comment - But as you mention in a code comment in TestICUTokenizer, there are no full-width WordBreak:Numeric characters, so we could just add these to the Unknown macro: {NumericEx} macro, I think. Was there anything else you were thinking of? No, that's it! Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies. What valid constituencies do you refer to? In general the acronym,company,possessive stuff here are very english/euro-specific. Bugs in JIRA get opened if it doesn't do this stuff right on english, but it doesn't even work at all for a lot of languages. Personally I think its great to rip this stuff out of what should be a "default" language-independent tokenizer based on standards (StandardTokenizer), and put it into the language-specific package that it belongs. Otherwise we have to worry about these sort of things overriding and screwing up UAX#29 rules for words in real languages. What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?) It gets a little tricky: we should be careful about how we interpret what is "reasonable" for a language-independent default tokenizer. I think its "enough" to output the best indexing unit that is possible and relatively unambiguous to identify. I think this is a shortcut we can make, because we are trying to tokenize things for information retrieval, not for other purposes. The approach for Lao, Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as indexing unit, since words are ambiguous. Thai is based on words, not syllables, in ICUTokenizer, which is inconsistent from this, but we get this for free, so its just a laziness thing. By the way: none of those syllable-grammars in ICUTokenizer used chained rules, so you are welcome to steal what you want! I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think? Well, either way I again strongly feel this logic should be tied into "Standard" tokenizer, so that it has better unicode behavior. I think it makes sense for us to have a reasonable, language-independent, standards-based tokenizer that works well for most languages. I think it also makes sense to have English/Euro-centric stuff thats language-specific, sitting in the analysis.en package just like we do with other languages.
          Hide
          Robert Muir added a comment -

          One other thing, Robert: what do you think of adding URL tokenization?

          I think I would lean towards not doing this, only because of how complex a URL can be these days. It also
          starts to get a little ambiguous and will likely interfere with other rules (generating a lot of false positives).

          I guess I don't care much either way, if its strict and standards-based, it probably won't cause any harm.
          But if you start allowing things like http urls without the http:// being present, its gonna cause some problems.

          Show
          Robert Muir added a comment - One other thing, Robert: what do you think of adding URL tokenization? I think I would lean towards not doing this, only because of how complex a URL can be these days. It also starts to get a little ambiguous and will likely interfere with other rules (generating a lot of false positives). I guess I don't care much either way, if its strict and standards-based, it probably won't cause any harm. But if you start allowing things like http urls without the http:// being present, its gonna cause some problems.
          Hide
          Steve Rowe added a comment -

          Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.

          What valid constituencies do you refer to?

          Well, we can't call it English/EuropeanTokenizer (maybe EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English or only European seems to leave the other out. Americans, e.g., don't consider themselves European, maybe not even linguistically (however incorrect that might be).

          In general the acronym,company,possessive stuff here are very english/euro-specific.

          Right, I agree. I'm just looking for a name that covers the languages of interest unambiguously. WesternTokenizer? (but "I live east of the Rockies - can I use WesternTokenizer?"...) Maybe EuropeanLanguagesTokenizer? The difficulty as I see it is the messy intersection between political, geographic, and linguistic boundaries.

          Bugs in JIRA get opened if it doesn't do this stuff right on english, but it doesn't even work at all for a lot of languages. Personally I think its great to rip this stuff out of what should be a "default" language-independent tokenizer based on standards (StandardTokenizer), and put it into the language-specific package that it belongs. Otherwise we have to worry about these sort of things overriding and screwing up UAX#29 rules for words in real languages.

          I assume you don't mean to say that English and European languages are not real languages .

          What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?)

          It gets a little tricky: we should be careful about how we interpret what is "reasonable" for a language-independent default tokenizer. I think its "enough" to output the best indexing unit that is possible and relatively unambiguous to identify. I think this is a shortcut we can make, because we are trying to tokenize things for information retrieval, not for other purposes. The approach for Lao, Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as indexing unit, since words are ambiguous. Thai is based on words, not syllables, in ICUTokenizer, which is inconsistent from this, but we get this for free, so its just a laziness thing.

          I think that StandardTokenizer should contain tailorings for CJK, Thai, Lao, Myanmar, and Khmer, then - it should be able to do reasonable things for all languages/scripts, to the greatest extent possible.

          The English/European tokenizer can then extend StandardTokenizer (conceptually, not in the Java sense).

          I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?

          Well, either way I again strongly feel this logic should be tied into "Standard" tokenizer, so that it has better unicode behavior. I think it makes sense for us to have a reasonable, language-independent, standards-based tokenizer that works well for most languages. I think it also makes sense to have English/Euro-centric stuff thats language-specific, sitting in the analysis.en package just like we
          do with other languages.

          I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages.

          It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en package to get reasonable performance for her language.

          Maybe an EnglishTokenizer, and separately a EuropeanAnalyzer? Is that what you've been driving at all along??? (Silly me.... Sigh.)

          Show
          Steve Rowe added a comment - Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies. What valid constituencies do you refer to? Well, we can't call it English/EuropeanTokenizer (maybe EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English or only European seems to leave the other out. Americans, e.g., don't consider themselves European, maybe not even linguistically (however incorrect that might be). In general the acronym,company,possessive stuff here are very english/euro-specific. Right, I agree. I'm just looking for a name that covers the languages of interest unambiguously. WesternTokenizer? (but "I live east of the Rockies - can I use WesternTokenizer?"...) Maybe EuropeanLanguagesTokenizer? The difficulty as I see it is the messy intersection between political, geographic, and linguistic boundaries. Bugs in JIRA get opened if it doesn't do this stuff right on english, but it doesn't even work at all for a lot of languages. Personally I think its great to rip this stuff out of what should be a "default" language-independent tokenizer based on standards (StandardTokenizer), and put it into the language-specific package that it belongs. Otherwise we have to worry about these sort of things overriding and screwing up UAX#29 rules for words in real languages. I assume you don't mean to say that English and European languages are not real languages . What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?) It gets a little tricky: we should be careful about how we interpret what is "reasonable" for a language-independent default tokenizer. I think its "enough" to output the best indexing unit that is possible and relatively unambiguous to identify. I think this is a shortcut we can make, because we are trying to tokenize things for information retrieval, not for other purposes. The approach for Lao, Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as indexing unit, since words are ambiguous. Thai is based on words, not syllables, in ICUTokenizer, which is inconsistent from this, but we get this for free, so its just a laziness thing. I think that StandardTokenizer should contain tailorings for CJK, Thai, Lao, Myanmar, and Khmer, then - it should be able to do reasonable things for all languages/scripts, to the greatest extent possible. The English/European tokenizer can then extend StandardTokenizer (conceptually, not in the Java sense). I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think? Well, either way I again strongly feel this logic should be tied into "Standard" tokenizer, so that it has better unicode behavior. I think it makes sense for us to have a reasonable, language-independent, standards-based tokenizer that works well for most languages. I think it also makes sense to have English/Euro-centric stuff thats language-specific, sitting in the analysis.en package just like we do with other languages. I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages. It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en package to get reasonable performance for her language. Maybe an EnglishTokenizer, and separately a EuropeanAnalyzer? Is that what you've been driving at all along??? (Silly me.... Sigh.)
          Hide
          Steve Rowe added a comment -

          What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese?

          By the way: none of those syllable-grammars in ICUTokenizer used chained rules, so you are welcome to steal what you want!

          Thanks, I will! Of course now that you've given permission, it won't be as much fun...

          Show
          Steve Rowe added a comment - What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? By the way: none of those syllable-grammars in ICUTokenizer used chained rules, so you are welcome to steal what you want! Thanks, I will! Of course now that you've given permission, it won't be as much fun...
          Hide
          Steve Rowe added a comment -

          One other thing, Robert: what do you think of adding URL tokenization?

          I think I would lean towards not doing this, only because of how complex a URL can be these days. It also starts to get a little ambiguous and will likely interfere with other rules (generating a lot of false positives).

          I have written standards-based URL tokenization routines in the past. I agree it's very complex, but I know it's do-able.

          Do you have some examples of false positives? I'd like to add tests for them.

          I guess I don't care much either way, if its strict and standards-based, it probably won't cause any harm. But if you start allowing things like http urls without the http:// being present, its gonna cause some problems.

          Yup, I would only accept strictly correct URLs.

          Now that international TLDs are a reality, it would be cool to be able to identify them.

          Show
          Steve Rowe added a comment - One other thing, Robert: what do you think of adding URL tokenization? I think I would lean towards not doing this, only because of how complex a URL can be these days. It also starts to get a little ambiguous and will likely interfere with other rules (generating a lot of false positives). I have written standards-based URL tokenization routines in the past. I agree it's very complex, but I know it's do-able. Do you have some examples of false positives? I'd like to add tests for them. I guess I don't care much either way, if its strict and standards-based, it probably won't cause any harm. But if you start allowing things like http urls without the http:// being present, its gonna cause some problems. Yup, I would only accept strictly correct URLs. Now that international TLDs are a reality, it would be cool to be able to identify them.
          Hide
          Robert Muir added a comment -

          I assume you don't mean to say that English and European languages are not real languages .

          I think the heuristics I am talking about that are in StandardTokenizer today, that don't really even work*,
          shouldn't have a negative effect on other languages, thats all.

          I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages.

          It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en package to get reasonable performance for her language.

          fyi, GreekAnalyzer didn't even use this stuff until 3.1 (it omitted StandardFilter).

          But I don't think it matters where we put the "western" tokenizer, as long as its not StandardTokenizer.
          I don't really even care too much about the stuff it does honestly, I don't consider it very important, nor very
          accurate, only the source of many jira bugs* and hassle and confusion (invalidAcronym etc).
          Just seems to be more trouble than its worth.

          Show
          Robert Muir added a comment - I assume you don't mean to say that English and European languages are not real languages . I think the heuristics I am talking about that are in StandardTokenizer today, that don't really even work*, shouldn't have a negative effect on other languages, thats all. I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages. It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en package to get reasonable performance for her language. fyi, GreekAnalyzer didn't even use this stuff until 3.1 (it omitted StandardFilter). But I don't think it matters where we put the "western" tokenizer, as long as its not StandardTokenizer. I don't really even care too much about the stuff it does honestly, I don't consider it very important, nor very accurate, only the source of many jira bugs* and hassle and confusion (invalidAcronym etc). Just seems to be more trouble than its worth. LUCENE-1438 LUCENE-2244 LUCENE-1787 LUCENE-1403 LUCENE-1100 LUCENE-1556 LUCENE-571 LUCENE-34 LUCENE-1068 i stopped at this point, i think this is enough examples
          Hide
          Robert Muir added a comment -

          Yup, I would only accept strictly correct URLs.

          Now that international TLDs are a reality, it would be cool to be able to identify them.

          +1. This is in my opinion, the way such things in Standard Tokenizer should work.
          Perhaps too strict for some folks tastes, but correct!

          Show
          Robert Muir added a comment - Yup, I would only accept strictly correct URLs. Now that international TLDs are a reality, it would be cool to be able to identify them. +1. This is in my opinion, the way such things in Standard Tokenizer should work. Perhaps too strict for some folks tastes, but correct!
          Hide
          Marvin Humphrey added a comment -

          I find that it works well to parse URLs as multiple tokens, so long as the
          query parser tokenizes them as phrases rather than individual terms. That
          allows you to hit on URL substrings, so e.g. a document containing
          'http://www.example.com/index.html' is a hit for 'example.com'.

          Happily, no special treatment for URLs also makes for a simpler parser.

          Show
          Marvin Humphrey added a comment - I find that it works well to parse URLs as multiple tokens, so long as the query parser tokenizes them as phrases rather than individual terms. That allows you to hit on URL substrings, so e.g. a document containing 'http://www.example.com/index.html' is a hit for 'example.com'. Happily, no special treatment for URLs also makes for a simpler parser.
          Hide
          Steve Rowe added a comment -

          Good point, Marvin - indexing URLs makes no sense without query support for them. (Is this a stupid can of worms for me to have opened?) I have used Lucene tokenizers for other things than retrieval (e.g. term vectors as input to other processes), and I suspect I'm not alone. The ability to extract URLs would be very nice.

          Ideally, URL analysis would produce both the full URL as a single token, and as overlapping tokens the hostname, path components, etc. However, I don't think it's a good idea for the tokenizer to output overlapping tokens - I suspect this would break more than a few things.

          A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though.

          Show
          Steve Rowe added a comment - Good point, Marvin - indexing URLs makes no sense without query support for them. (Is this a stupid can of worms for me to have opened?) I have used Lucene tokenizers for other things than retrieval (e.g. term vectors as input to other processes), and I suspect I'm not alone. The ability to extract URLs would be very nice. Ideally, URL analysis would produce both the full URL as a single token, and as overlapping tokens the hostname, path components, etc. However, I don't think it's a good idea for the tokenizer to output overlapping tokens - I suspect this would break more than a few things. A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though.
          Hide
          Robert Muir added a comment -

          A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though.

          Not sure, for this to really work for non-english, it should recognize and normalize punycode representations of international domain names, etc.

          So while its a good idea, maybe it is a can of worms, and better to leave it alone for now?

          Show
          Robert Muir added a comment - A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though. Not sure, for this to really work for non-english, it should recognize and normalize punycode representations of international domain names, etc. So while its a good idea, maybe it is a can of worms, and better to leave it alone for now?
          Hide
          Steve Rowe added a comment -

          A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though.

          Not sure, for this to really work for non-english, it should recognize and normalize punycode representations of international domain names, etc.

          So while its a good idea, maybe it is a can of worms, and better to leave it alone for now?

          Do you mean URL-as-token should not be attempted now? Or just this URL-breaking filter?

          Show
          Steve Rowe added a comment - A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though. Not sure, for this to really work for non-english, it should recognize and normalize punycode representations of international domain names, etc. So while its a good idea, maybe it is a can of worms, and better to leave it alone for now? Do you mean URL-as-token should not be attempted now? Or just this URL-breaking filter?
          Hide
          Robert Muir added a comment -

          Do you mean URL-as-token should not be attempted now? Or just this URL-breaking filter?

          We can always add tailorings later, as Uwe has implemented Version-based support.

          Personally I see no problems with this patch, and I think we should look at tying this in as-is as the new StandardTokenizer, still backwards compatible thanks to Version support (we can just invoke EnglishTokenizerImpl in that case).

          I still want to rip StandardTokenizer out of lucene core and into modules. I think thats not too far away and its probably better to do this afterwards?, but we can do it before that time if you want, doesn't matter to me.

          It will be great to have StandardTokenizer working for non-European languages out of box!

          Show
          Robert Muir added a comment - Do you mean URL-as-token should not be attempted now? Or just this URL-breaking filter? We can always add tailorings later, as Uwe has implemented Version-based support. Personally I see no problems with this patch, and I think we should look at tying this in as-is as the new StandardTokenizer, still backwards compatible thanks to Version support (we can just invoke EnglishTokenizerImpl in that case). I still want to rip StandardTokenizer out of lucene core and into modules. I think thats not too far away and its probably better to do this afterwards?, but we can do it before that time if you want, doesn't matter to me. It will be great to have StandardTokenizer working for non-European languages out of box!
          Hide
          Steve Rowe added a comment -

          I think UAX29Tokenizer should remain as-is, except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons, as CJ chars are now. I need to augment the tests and make sure that valid word/number chars are not being dropped. Also, I want to add full-width numeric chars to the

          {NumericEx}

          macro.

          A separate replacement StandardTokenizer class should have standards-based email and url tokenization - the current StandardTokenizer gets part of the way there, but doesn't support some valid emails, and while it recognizes host/domain names, it doesn't recognize full URLs. I want to get this done before anything in this issue is committed.

          Then (after this issue is committed), in separate issues, we can add EnglishTokenizer (for things like acronyms and maybe removing posessives (current StandardFilter), and then as needed, other language-specific tokenizers.

          I still want to rip StandardTokenizer out of lucene core and into modules. I think thats not too far away and its probably better to do this afterwards?, but we can do it before that time if you want, doesn't matter to me.

          I'll finish the UAX29Tokenizer fixes this weekend, but I think it'll take me a week or so to get the URL/email tokenization in place.

          Show
          Steve Rowe added a comment - I think UAX29Tokenizer should remain as-is, except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons, as CJ chars are now. I need to augment the tests and make sure that valid word/number chars are not being dropped. Also, I want to add full-width numeric chars to the {NumericEx} macro. A separate replacement StandardTokenizer class should have standards-based email and url tokenization - the current StandardTokenizer gets part of the way there, but doesn't support some valid emails, and while it recognizes host/domain names, it doesn't recognize full URLs. I want to get this done before anything in this issue is committed. Then (after this issue is committed), in separate issues, we can add EnglishTokenizer (for things like acronyms and maybe removing posessives (current StandardFilter), and then as needed, other language-specific tokenizers. I still want to rip StandardTokenizer out of lucene core and into modules. I think thats not too far away and its probably better to do this afterwards?, but we can do it before that time if you want, doesn't matter to me. I'll finish the UAX29Tokenizer fixes this weekend, but I think it'll take me a week or so to get the URL/email tokenization in place.
          Hide
          Steve Rowe added a comment -

          Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words.

          Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term?

          Show
          Steve Rowe added a comment - Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words. Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term?
          Hide
          Robert Muir added a comment -

          Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words.
          Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term?

          You don't need any special support.

          I don't know how this hack founds its way in, but from a Thai tokenization perspective the only thing it is doing is preventing StandardTokenizer from splitting thai on non-spacing marks (like it does wrongly for other languages).

          So UAX#29 itself is the fix...

          Show
          Robert Muir added a comment - Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words. Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term? You don't need any special support. I don't know how this hack founds its way in, but from a Thai tokenization perspective the only thing it is doing is preventing StandardTokenizer from splitting thai on non-spacing marks (like it does wrongly for other languages). So UAX#29 itself is the fix...
          Hide
          Robert Muir added a comment -

          except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons

          Do you have any examples?

          Show
          Robert Muir added a comment - except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons Do you have any examples?
          Hide
          Steve Rowe added a comment -

          except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons

          Do you have any examples?

          I imported your tests from TestICUTokenizer, but I left out Lao, Myanmar and Thai because I didn't plan on adding tailorings like those you put in for ICUTokenizer. However, I think Lao had zero tokens output, so if you just import the Lao test from TestICUTokenizer you should see the issue.

          Show
          Steve Rowe added a comment - except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons Do you have any examples? I imported your tests from TestICUTokenizer, but I left out Lao, Myanmar and Thai because I didn't plan on adding tailorings like those you put in for ICUTokenizer. However, I think Lao had zero tokens output, so if you just import the Lao test from TestICUTokenizer you should see the issue.
          Hide
          Steve Rowe added a comment -

          Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words.

          Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term?

          I don't know how this hack founds its way in, but from a Thai tokenization perspective the only thing it is doing is preventing StandardTokenizer from splitting thai on non-spacing marks (like it does wrongly for other languages).

          So UAX#29 itself is the fix...

          AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate?

          Show
          Steve Rowe added a comment - Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words. Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term? I don't know how this hack founds its way in, but from a Thai tokenization perspective the only thing it is doing is preventing StandardTokenizer from splitting thai on non-spacing marks (like it does wrongly for other languages). So UAX#29 itself is the fix... AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate?
          Hide
          Robert Muir added a comment -

          However, I think Lao had zero tokens output, so if you just import the Lao test from TestICUTokenizer you should see the issue.

          Ok, I will take a look. The algorithm there has some handling for incorrectly ordered unicode, for example combining characters before the base form when they should be after... so it might be no problem at all

          Show
          Robert Muir added a comment - However, I think Lao had zero tokens output, so if you just import the Lao test from TestICUTokenizer you should see the issue. Ok, I will take a look. The algorithm there has some handling for incorrectly ordered unicode, for example combining characters before the base form when they should be after... so it might be no problem at all
          Hide
          Robert Muir added a comment -

          AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate?

          What is a Thai character? . According to the standard, it should be outputting phrases as there is nothing to delimit them... you can see this by pasting some text into http://unicode.org/cldr/utility/breaks.jsp

          Show
          Robert Muir added a comment - AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate? What is a Thai character? . According to the standard, it should be outputting phrases as there is nothing to delimit them... you can see this by pasting some text into http://unicode.org/cldr/utility/breaks.jsp
          Hide
          Steve Rowe added a comment -

          AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate?

          What is a Thai character? . According to the standard, it should be outputting phrases as there is nothing to delimit them... you can see this by pasting some text into http://unicode.org/cldr/utility/breaks.jsp

          Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output.

          Show
          Steve Rowe added a comment - AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate? What is a Thai character? . According to the standard, it should be outputting phrases as there is nothing to delimit them... you can see this by pasting some text into http://unicode.org/cldr/utility/breaks.jsp Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output.
          Hide
          Robert Muir added a comment -

          Hmm i ran some tests, I think i see your problem.

          I tried this:

            public void testThai() throws Exception {
              assertAnalyzesTo(a, "ภาษาไทย", new String[] { "ภาษาไทย" });
            }
          

          The reason you get something different than the unicode site, is because recently? these have [:WordBreak=Other:]
          Instead anything that needs a dictionary or whatever is identified by [:Line_Break=Complex_Context:]
          You can see this mentioned in the standard:

          In particular, the characters with the Line_Break property values of Contingent_Break (CB), 
          Complex_Context (SA/South East Asian), and XX (Unknown) are assigned word boundary property 
          values based on criteria outside of the scope of this annex. 
          

          In ICU, i noticed the default rules do this:
          $dictionary = [:LineBreak = Complex_Context:];
          $dictionary $dictionary

          (so they just stick together with this chained rule)

          Show
          Robert Muir added a comment - Hmm i ran some tests, I think i see your problem. I tried this: public void testThai() throws Exception { assertAnalyzesTo(a, "ภาษาไทย" , new String [] { "ภาษาไทย" }); } The reason you get something different than the unicode site, is because recently? these have [:WordBreak=Other:] Instead anything that needs a dictionary or whatever is identified by [:Line_Break=Complex_Context:] You can see this mentioned in the standard: In particular, the characters with the Line_Break property values of Contingent_Break (CB), Complex_Context (SA/South East Asian), and XX (Unknown) are assigned word boundary property values based on criteria outside of the scope of this annex. In ICU, i noticed the default rules do this: $dictionary = [:LineBreak = Complex_Context:] ; $dictionary $dictionary (so they just stick together with this chained rule)
          Hide
          Robert Muir added a comment -

          Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output.

          But why does it fail for my test (listed above) with only a single thai phrase (nothing is output)?
          Do you think it is because of Complex_Context or is there an off-by-one bug somehow?

          Show
          Robert Muir added a comment - Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output. But why does it fail for my test (listed above) with only a single thai phrase (nothing is output)? Do you think it is because of Complex_Context or is there an off-by-one bug somehow?
          Hide
          Steve Rowe added a comment -

          Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output.

          But why does it fail for my test (listed above) with only a single thai phrase (nothing is output)?
          Do you think it is because of Complex_Context or is there an off-by-one bug somehow?

          Definitely Complex_Content. I'll add that in, and this should address Thai, Myanmar, Khmer, Tai Le, etc.

          Show
          Steve Rowe added a comment - Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output. But why does it fail for my test (listed above) with only a single thai phrase (nothing is output)? Do you think it is because of Complex_Context or is there an off-by-one bug somehow? Definitely Complex_Content. I'll add that in, and this should address Thai, Myanmar, Khmer, Tai Le, etc.
          Hide
          Steve Rowe added a comment - - edited

          New patch addressing the following issues:

          • On #lucene-dev, Uwe mentioned that methods in the generated scanner should be (package) private, since unlike the current StandardTokenizer, UAX29Tokenizer is not hidden behind a facade class. I added JFlex's %apiprivate option to fix this issue.
          • Thai, Lao, Khmer, Myanmar and other scripts' characters are now kept together, like the ICU UAX#29 implementation, using rule [:Line_Break = Complex_Context:]+.
          • Added the Thai test back from Robert's TestICUTokenizer.
          • Added full-width numeric characters to the {NumericEx}

            macro, so that they can be appropriately tokenized, just like full-width alpha characters are now.

          I couldn't find any suitable Lao test text (mostly because I don't know Lao at all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned on #lucene that its characters are not in logical order.

          edit Complex_Content --> Complex_Context
          edit #2 Added bullet about full-width numerics issue

          Show
          Steve Rowe added a comment - - edited New patch addressing the following issues: On #lucene-dev, Uwe mentioned that methods in the generated scanner should be (package) private, since unlike the current StandardTokenizer, UAX29Tokenizer is not hidden behind a facade class. I added JFlex's %apiprivate option to fix this issue. Thai, Lao, Khmer, Myanmar and other scripts' characters are now kept together, like the ICU UAX#29 implementation, using rule [:Line_Break = Complex_Context:] +. Added the Thai test back from Robert's TestICUTokenizer. Added full-width numeric characters to the {NumericEx} macro, so that they can be appropriately tokenized, just like full-width alpha characters are now. I couldn't find any suitable Lao test text (mostly because I don't know Lao at all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned on #lucene that its characters are not in logical order. edit Complex_Content --> Complex_Context edit #2 Added bullet about full-width numerics issue
          Hide
          Robert Muir added a comment -

          I couldn't find any suitable Lao test text (mostly because I don't know Lao at all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned on #lucene that its characters are not in logical order.

          Only some of my icu tests contain "screwed up lao".

          But you should be able to use "good text" and it should do the right thing.
          Here's a test

          assertAnalyzesTo(a, "ສາທາລະນະລັດ ປະຊາທິປະໄຕ ປະຊາຊົນລາວ", 
          new String[] { "ສາທາລະນະລັດ", "ປະຊາທິປະໄຕ", "ປະຊາຊົນລາວ" });
          
          Show
          Robert Muir added a comment - I couldn't find any suitable Lao test text (mostly because I don't know Lao at all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned on #lucene that its characters are not in logical order. Only some of my icu tests contain "screwed up lao". But you should be able to use "good text" and it should do the right thing. Here's a test assertAnalyzesTo(a, "ສາທາລະນະລັດ ປະຊາທິປະໄຕ ປະຊາຊົນລາວ" , new String [] { "ສາທາລະນະລັດ" , "ປະຊາທິປະໄຕ" , "ປະຊາຊົນລາວ" });
          Hide
          Steve Rowe added a comment -

          New patch:

          • added Robert's Lao test (thanks, Robert).
          • added a javadoc comment about UAX29Tokenizer not handling supplementary characters (thanks to Uwe for bringing this up on #lucene), with a pointer to Robert's ICUTokenizer.
          Show
          Steve Rowe added a comment - New patch: added Robert's Lao test (thanks, Robert). added a javadoc comment about UAX29Tokenizer not handling supplementary characters (thanks to Uwe for bringing this up on #lucene), with a pointer to Robert's ICUTokenizer.
          Hide
          Steve Rowe added a comment -

          This patch contains the benchmarking implementation I've been using. I'm pretty sure we don't want this stuff in Lucene, so I'm including it here only for reproducibility by others. I have hardcoded absolute paths to the ICU4J jar and the contrib/icu jar in the script I use to run the benchmark (lucene/contrib/benchmark/scripts/compare.uax29.analyzers.sh), so if anybody tries to run this stuff, they will have to first modify that script.

          On #lucene, Robert suggested comparing the performance of the straight ICU4J RBBI against UAX29Tokenizer, so I took his ICUTokenizer and associated classes, stripped out the script-detection logic, and made something I named RBBITokenizer, which is included in this patch.

          To run the benchmark, you have to first run "ant jar" in lucene/ to produce the lucene core jar, and then again in lucene/contrib/icu/. Then in contrib/benchmark/, run scripts/compare.uax29.analyzers.sh.

          Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five):

          Operation recsPerRun rec/s elapsedSec
          ICUTokenizer 1268451 548,638.00 2.31
          RBBITokenizer 1268451 568,047.94 2.23
          StandardTokenizer 1262799 644,614.06 1.96
          UAX29Tokenizer 1268451 640,631.81 1.98
          Show
          Steve Rowe added a comment - This patch contains the benchmarking implementation I've been using. I'm pretty sure we don't want this stuff in Lucene, so I'm including it here only for reproducibility by others. I have hardcoded absolute paths to the ICU4J jar and the contrib/icu jar in the script I use to run the benchmark ( lucene/contrib/benchmark/scripts/compare.uax29.analyzers.sh ), so if anybody tries to run this stuff, they will have to first modify that script. On #lucene, Robert suggested comparing the performance of the straight ICU4J RBBI against UAX29Tokenizer, so I took his ICUTokenizer and associated classes, stripped out the script-detection logic, and made something I named RBBITokenizer, which is included in this patch. To run the benchmark, you have to first run "ant jar" in lucene/ to produce the lucene core jar, and then again in lucene/contrib/icu/ . Then in contrib/benchmark/ , run scripts/compare.uax29.analyzers.sh . Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five): Operation recsPerRun rec/s elapsedSec ICUTokenizer 1268451 548,638.00 2.31 RBBITokenizer 1268451 568,047.94 2.23 StandardTokenizer 1262799 644,614.06 1.96 UAX29Tokenizer 1268451 640,631.81 1.98
          Hide
          Robert Muir added a comment -

          Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five):

          This is really cool, I think its a great benchmark to know, I played with it and saw similar results.

          • For Lucene Tokenizer-ish (forward-iteration) purposes, JFlex is quite a bit faster than RBBI for unicode segmentation.
          • Supporting unicode segmentation in StandardTokenizer doesn't slow it down in comparison to the current implementation.
          • The script detection/delegation in ICU doesn't really cost that tokenizer much; though, the benchmark is reuters, and it cheats for Latin-1 (see bottom of ScriptIterator.java).
          Show
          Robert Muir added a comment - Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five): This is really cool, I think its a great benchmark to know, I played with it and saw similar results. For Lucene Tokenizer-ish (forward-iteration) purposes, JFlex is quite a bit faster than RBBI for unicode segmentation. Supporting unicode segmentation in StandardTokenizer doesn't slow it down in comparison to the current implementation. The script detection/delegation in ICU doesn't really cost that tokenizer much; though, the benchmark is reuters, and it cheats for Latin-1 (see bottom of ScriptIterator.java).
          Hide
          Steve Rowe added a comment -

          Robert, what do you think of <SOUTHEAST_ASIAN> for the token type for Thai, Khmer, Lao, etc. Complex_Context runs?

          Show
          Steve Rowe added a comment - Robert, what do you think of <SOUTHEAST_ASIAN> for the token type for Thai, Khmer, Lao, etc. Complex_Context runs?
          Hide
          Steve Rowe added a comment -

          Sequences of South East Asian scripts are now assigned term type <SOUTHEAST_ASIAN> by UAX29Tokenizer. I think UAX29Tokenizer is now a complete untailored UAX#29 implementation.

          For the future StandardTokenizer replacement, I plan on making a copy of the UAX29Tokenizer grammar and adding email/URL tokenization, and maybe Southeast Asian tailorings converted from those in ICUTokenizer.

          Show
          Steve Rowe added a comment - Sequences of South East Asian scripts are now assigned term type <SOUTHEAST_ASIAN> by UAX29Tokenizer. I think UAX29Tokenizer is now a complete untailored UAX#29 implementation. For the future StandardTokenizer replacement, I plan on making a copy of the UAX29Tokenizer grammar and adding email/URL tokenization, and maybe Southeast Asian tailorings converted from those in ICUTokenizer.
          Hide
          Steve Rowe added a comment -

          As of r591, JFlex now has code in the generated yyreset() method to resize the internal scan buffer (zzBuffer) back down to its initial size if it has grown. This is exactly the same workaround code in the reset() method in the UAX29Tokenizer grammar.

          This patch just removes the scan buffer size check and reallocation code from reset() in the .jflex file, as well as the .java file generated with r591 JFlex.

          Show
          Steve Rowe added a comment - As of r591, JFlex now has code in the generated yyreset() method to resize the internal scan buffer (zzBuffer) back down to its initial size if it has grown. This is exactly the same workaround code in the reset() method in the UAX29Tokenizer grammar. This patch just removes the scan buffer size check and reallocation code from reset() in the .jflex file, as well as the .java file generated with r591 JFlex.
          Hide
          Robert Muir added a comment -

          This patch just removes the scan buffer size check and reallocation code from reset() in the .jflex file, as well as the .java file generated with r591 JFlex.

          We have this code in our existing StandardTokenizer .jflex files, should we open an issue and fix these (we would have to ensure that we use a jflex > r591 for generation?)

          Additionally shouldn't we regen WikipediaTokenizer etc too, I noticed it doesnt even have the hack in its .jflex file.

          Show
          Robert Muir added a comment - This patch just removes the scan buffer size check and reallocation code from reset() in the .jflex file, as well as the .java file generated with r591 JFlex. We have this code in our existing StandardTokenizer .jflex files, should we open an issue and fix these (we would have to ensure that we use a jflex > r591 for generation?) Additionally shouldn't we regen WikipediaTokenizer etc too, I noticed it doesnt even have the hack in its .jflex file.
          Hide
          Uwe Schindler added a comment - - edited

          Yeah we should regen all jflex files when pathing this (ant jflex does this automatically, so we dont need to care). Removing the hack from StandardTokenizers jflex file should be done in an issue, but it also does not hurt if the hack stays in code.

          Checking the jflex version is hard to do, i think about it, maybe there is an ANT trick. Is the version noted somewhere in a class file as constant?

          I think we should simply reopen LUCENE-2384 (its part of 3x and trunk)

          Show
          Uwe Schindler added a comment - - edited Yeah we should regen all jflex files when pathing this (ant jflex does this automatically, so we dont need to care). Removing the hack from StandardTokenizers jflex file should be done in an issue, but it also does not hurt if the hack stays in code. Checking the jflex version is hard to do, i think about it, maybe there is an ANT trick. Is the version noted somewhere in a class file as constant? I think we should simply reopen LUCENE-2384 (its part of 3x and trunk)
          Hide
          Steve Rowe added a comment -

          Yeah we should regen all jflex files when pathing this (ant jflex does this automatically, so we dont need to care). Removing the hack from StandardTokenizers jflex file should be done in an issue, but it also does not hurt if the hack stays in code.

          Agreed. I was thinking since Robert is moving StandardTokenizer that the regen could wait until afterward.

          Checking the jflex version is hard to do, i think about it, maybe there is an ANT trick. Is the version noted somewhere in a class file as constant?

          Release version is, I think, but we're using an unreleased version ATM. Hmm, for the SVN checkout, maybe the .svn/entries file could be checked or something? If we go that route (and I think it's probably not a good idea), we should instead maybe be "svn up"ing the checkout?

          I think we should simply reopen LUCENE-2384 (its part of 3x and trunk)

          +1

          Show
          Steve Rowe added a comment - Yeah we should regen all jflex files when pathing this (ant jflex does this automatically, so we dont need to care). Removing the hack from StandardTokenizers jflex file should be done in an issue, but it also does not hurt if the hack stays in code. Agreed. I was thinking since Robert is moving StandardTokenizer that the regen could wait until afterward. Checking the jflex version is hard to do, i think about it, maybe there is an ANT trick. Is the version noted somewhere in a class file as constant? Release version is, I think, but we're using an unreleased version ATM. Hmm, for the SVN checkout, maybe the .svn/entries file could be checked or something? If we go that route (and I think it's probably not a good idea), we should instead maybe be "svn up"ing the checkout? I think we should simply reopen LUCENE-2384 (its part of 3x and trunk) +1
          Hide
          DM Smith added a comment -

          Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.

          What valid constituencies do you refer to?

          Well, we can't call it English/EuropeanTokenizer (maybe EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English or only European seems to leave the other out. Americans, e.g., don't consider themselves European, maybe not even linguistically (however incorrect that might be).

          Tongue in cheek:
          By and large, these are Romance languages (i.e. latin derivatives). And the constructs that are being considered for special processing for the most part are fairly recent additions to the languages. So how about ModernRomanceAnalyzer?

          Show
          DM Smith added a comment - Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies. What valid constituencies do you refer to? Well, we can't call it English/EuropeanTokenizer (maybe EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English or only European seems to leave the other out. Americans, e.g., don't consider themselves European, maybe not even linguistically (however incorrect that might be). Tongue in cheek: By and large, these are Romance languages (i.e. latin derivatives). And the constructs that are being considered for special processing for the most part are fairly recent additions to the languages. So how about ModernRomanceAnalyzer ?
          Hide
          Steve Rowe added a comment -

          My daughter likes the Lady Gaga song "Bad Romance" - why not BadRomanceAnalyzer? Advertizing slogans: "It slices your text when it's supposed to dice it, but it always apologizes afterward - how can you stay mad?"; "Who knew that analysis could have such catchy lyrics?"

          Show
          Steve Rowe added a comment - My daughter likes the Lady Gaga song "Bad Romance" - why not BadRomanceAnalyzer ? Advertizing slogans: "It slices your text when it's supposed to dice it, but it always apologizes afterward - how can you stay mad?"; "Who knew that analysis could have such catchy lyrics?"
          Hide
          Steve Rowe added a comment -

          Updated to trunk. Tests pass.

          This patch removes the jflex-* target dependencies on init, since init builds Lucene, which isn't a necessity prior running JFlex.

          Show
          Steve Rowe added a comment - Updated to trunk. Tests pass. This patch removes the jflex-* target dependencies on init, since init builds Lucene, which isn't a necessity prior running JFlex.
          Hide
          Steve Rowe added a comment -

          Maven plugin including a mojo that generates a file containing a JFlex macro that accepts all valid ASCII top-level domains (TLDs), by downloading the IANA Root Zone Database, parsing the HTML file, and outputting ASCIITLD.jflex-macro into the analysis/common/src/java/org/apache/lucene/analysis/standard/ source directory; this file is also included in the patch.

          To run the Maven plugin, first run "mvn install" from the lucene-buildhelper-maven-plugin/ directory, then from the src/java/org/apache/lucene/analysis/standard/ directory, run the following command:

          mvn org.apache.lucene:lucene-buildhelper-maven-plugin:generate-jflex-tld-macros
          

          Execution is not yet hooked into build.xml, but this goal should run before JFlex runs.

          Show
          Steve Rowe added a comment - Maven plugin including a mojo that generates a file containing a JFlex macro that accepts all valid ASCII top-level domains (TLDs), by downloading the IANA Root Zone Database, parsing the HTML file, and outputting ASCIITLD.jflex-macro into the analysis/common/src/java/org/apache/lucene/analysis/standard/ source directory; this file is also included in the patch. To run the Maven plugin, first run "mvn install" from the lucene-buildhelper-maven-plugin/ directory, then from the src/java/org/apache/lucene/analysis/standard/ directory, run the following command: mvn org.apache.lucene:lucene-buildhelper-maven-plugin:generate-jflex-tld-macros Execution is not yet hooked into build.xml, but this goal should run before JFlex runs.
          Hide
          Uwe Schindler added a comment - - edited

          Hi Steven,

          looks cool, I have some suggestions:

          • Must it be a maven plugin? From what I see, the same code could be done as a simple Java Class with main() like Roberts ICU converter. The external dependency to httpclient can be replaces by simply java.net.HttpUrlConnection and the URL itsself (you can even set the no-cache directives). Its much easier from ant to invoke a java method as a build step. So why not refactor a little bit to use a main() method that acceps the target directory.
          • You use the HTML root zone database from IANA. The format of this file is hard to parse and may change suddenly. BIND administrators know, that there is also the root zone file available for BIND in the standardized named-format @ http://www.internic.net/zones/root.zone (ASCII only, as DNS is ASCII only). You just have to use all rows that are not comments and contain "NS" as second token. The nameservers behind are not used, just use the DNS name before. This should be much easier to do. A python script may also work well.
          • You can write the Last-Modified-Header of the HTTP-date (HttpURLConnection.getLastModified()) also into the generated file.
          • The database only contains the punycode enabled DNS names. But users use the non-encoded variants, so you should decode punycode, too [we need ICU for that :( ] and create patterns for that, too.
          • About changes in analyzer syntax because of regeneration: This should not be a problem, as the IANA only adds new zones to the file and very seldom removes some (like old yugoslavian zones). As eMails and Webadresses should not appear in tokenized text before they are in the zone file, its no problem that they suddenly later are marked as "URL/eMail" (as they cannot appear before). So in my opinion we can update the zone database even in minor Lucene releases without breaking analyzers.

          Fine idea!

          Show
          Uwe Schindler added a comment - - edited Hi Steven, looks cool, I have some suggestions: Must it be a maven plugin? From what I see, the same code could be done as a simple Java Class with main() like Roberts ICU converter. The external dependency to httpclient can be replaces by simply java.net.HttpUrlConnection and the URL itsself (you can even set the no-cache directives). Its much easier from ant to invoke a java method as a build step. So why not refactor a little bit to use a main() method that acceps the target directory. You use the HTML root zone database from IANA. The format of this file is hard to parse and may change suddenly. BIND administrators know, that there is also the root zone file available for BIND in the standardized named-format @ http://www.internic.net/zones/root.zone (ASCII only, as DNS is ASCII only). You just have to use all rows that are not comments and contain "NS" as second token. The nameservers behind are not used, just use the DNS name before. This should be much easier to do. A python script may also work well. You can write the Last-Modified-Header of the HTTP-date (HttpURLConnection.getLastModified()) also into the generated file. The database only contains the punycode enabled DNS names. But users use the non-encoded variants, so you should decode punycode, too [we need ICU for that :( ] and create patterns for that, too. About changes in analyzer syntax because of regeneration: This should not be a problem, as the IANA only adds new zones to the file and very seldom removes some (like old yugoslavian zones). As eMails and Webadresses should not appear in tokenized text before they are in the zone file, its no problem that they suddenly later are marked as "URL/eMail" (as they cannot appear before). So in my opinion we can update the zone database even in minor Lucene releases without breaking analyzers. Fine idea!
          Hide
          Steve Rowe added a comment -

          Must it be a maven plugin? [...] Its much easier from ant to invoke a java method as a build step.

          Lucene's build could be converted to Maven, though, and this could be a place for build-related stuff.

          Maven Ant Tasks allows for Ant to call full Maven builds without a Maven installation: http://maven.apache.org/ant-tasks/examples/mvn.html

          From what I see, the same code could be done as a simple Java Class with main() like Roberts ICU converter. [snip]

          I hadn't seen Robert's ICU converter - I'll take a look.

          A python script may also work well.

          Perl is my scripting language of choice, not Python, but yes, a script would likely do the trick, assuming there are no external (Java) dependencies. (And as you pointed out, HttpComponents, the only dependency of the Maven plugin, does not need to be a dependency.)

          You use the HTML root zone database from IANA. The format of this file is hard to parse and may change suddenly. BIND administrators know, that there is also the root zone file available for BIND in the standardized named-format @ http://www.internic.net/zones/root.zone (ASCII only, as DNS is ASCII only).

          I think I'll stick with the HTML version for now - there are no decoded versions of the internationalized TLDs and no descriptive information in the named-format version. I agree the HTML format is not ideal, but it took me just a little while to put together the regexes to parse it; when the format changes, the effort to fix will likely be similarly small.

          You can write the Last-Modified-Header of the HTTP-date (HttpURLConnection.getLastModified()) also into the generated file.

          Excellent idea, I searched the HTML page source for this kind of information but it wasn't there.

          The database only contains the punycode enabled DNS names. But users use the non-encoded variants, so you should decode punycode, too [we need ICU for that :( ] and create patterns for that, too.

          I agree. However, I looked into what's required to do internationalized domain names properly, and it's quite complicated. I plan on doing what you suggest eventually, both for TLDs and all other domain labels, but I'd rather finish the ASCII implementation and deal with IRIs in a separate follow-on issue.

          About changes in analyzer syntax because of regeneration: This should not be a problem, as the IANA only adds new zones to the file and very seldom removes some (like old yugoslavian zones). As eMails and Webadresses should not appear in tokenized text before they are in the zone file, its no problem that they suddenly later are marked as "URL/eMail" (as they cannot appear before). So in my opinion we can update the zone database even in minor Lucene releases without breaking analyzers.

          +1

          Show
          Steve Rowe added a comment - Must it be a maven plugin? [...] Its much easier from ant to invoke a java method as a build step. Lucene's build could be converted to Maven, though, and this could be a place for build-related stuff. Maven Ant Tasks allows for Ant to call full Maven builds without a Maven installation: http://maven.apache.org/ant-tasks/examples/mvn.html From what I see, the same code could be done as a simple Java Class with main() like Roberts ICU converter. [snip] I hadn't seen Robert's ICU converter - I'll take a look. A python script may also work well. Perl is my scripting language of choice, not Python, but yes, a script would likely do the trick, assuming there are no external (Java) dependencies. (And as you pointed out, HttpComponents, the only dependency of the Maven plugin, does not need to be a dependency.) You use the HTML root zone database from IANA. The format of this file is hard to parse and may change suddenly. BIND administrators know, that there is also the root zone file available for BIND in the standardized named-format @ http://www.internic.net/zones/root.zone (ASCII only, as DNS is ASCII only). I think I'll stick with the HTML version for now - there are no decoded versions of the internationalized TLDs and no descriptive information in the named-format version. I agree the HTML format is not ideal, but it took me just a little while to put together the regexes to parse it; when the format changes, the effort to fix will likely be similarly small. You can write the Last-Modified-Header of the HTTP-date (HttpURLConnection.getLastModified()) also into the generated file. Excellent idea, I searched the HTML page source for this kind of information but it wasn't there. The database only contains the punycode enabled DNS names. But users use the non-encoded variants, so you should decode punycode, too [we need ICU for that :( ] and create patterns for that, too. I agree. However, I looked into what's required to do internationalized domain names properly, and it's quite complicated. I plan on doing what you suggest eventually, both for TLDs and all other domain labels, but I'd rather finish the ASCII implementation and deal with IRIs in a separate follow-on issue. About changes in analyzer syntax because of regeneration: This should not be a problem, as the IANA only adds new zones to the file and very seldom removes some (like old yugoslavian zones). As eMails and Webadresses should not appear in tokenized text before they are in the zone file, its no problem that they suddenly later are marked as "URL/eMail" (as they cannot appear before). So in my opinion we can update the zone database even in minor Lucene releases without breaking analyzers. +1
          Hide
          Uwe Schindler added a comment -

          Here my patch with the TLD-macro generator:

          • Uses zone database from DNS (downloaded)
          • Outputs correct platform dependent newlines, else commits with SVN fail
          • Has no comments
          • Is included into build.xml. Run ant gen-tlds in modules/analysis/common

          The resulting macro is almost identical, 4 TLDs are missing, but the file on internic.net is actual (see last mod date). The comments are not available, of course.

          Show
          Uwe Schindler added a comment - Here my patch with the TLD-macro generator: Uses zone database from DNS (downloaded) Outputs correct platform dependent newlines, else commits with SVN fail Has no comments Is included into build.xml. Run ant gen-tlds in modules/analysis/common The resulting macro is almost identical, 4 TLDs are missing, but the file on internic.net is actual (see last mod date). The comments are not available, of course.
          Hide
          Uwe Schindler added a comment -

          Small update (dont output lastMod date if internic.net gave none)

          Show
          Uwe Schindler added a comment - Small update (dont output lastMod date if internic.net gave none)
          Hide
          Uwe Schindler added a comment -

          Updated patch.

          I had not seen that the previous jflex generator version had a bug in missing locale in String.toUpperCase (turkish i!). This version uses Character.toUpperCase() [non-locale-aware] and also only iterates over tld.charAt() [what was the reason for the strange substring stuff?]. This is fine, as the TLDs only contain [\-A-Za-z0-9] (Standard for domain names and the regex enforces this, so no supplementary chars.

          This patch also creates correct macro (single escaping).

          Show
          Uwe Schindler added a comment - Updated patch. I had not seen that the previous jflex generator version had a bug in missing locale in String.toUpperCase (turkish i!). This version uses Character.toUpperCase() [non-locale-aware] and also only iterates over tld.charAt() [what was the reason for the strange substring stuff?] . This is fine, as the TLDs only contain [\-A-Za-z0-9] (Standard for domain names and the regex enforces this, so no supplementary chars. This patch also creates correct macro (single escaping).
          Hide
          Steve Rowe added a comment -

          This version uses Character.toUpperCase() [non-locale-aware] and also only iterates over tld.charAt() [what was the reason for the strange substring stuff?].

          I looked for Character.toUpperCase(), didn't find it (no idea why), so went with the strange substring stuff to use the String version instead ...

          I plan on integrating your patch with mine, to make a single one, including a definition for a StandardTokenizer replacement. I have implemented URL, Email and Host rules, just gotta write some tests now.

          Show
          Steve Rowe added a comment - This version uses Character.toUpperCase() [non-locale-aware] and also only iterates over tld.charAt() [what was the reason for the strange substring stuff?] . I looked for Character.toUpperCase(), didn't find it (no idea why), so went with the strange substring stuff to use the String version instead ... I plan on integrating your patch with mine, to make a single one, including a definition for a StandardTokenizer replacement. I have implemented URL, Email and Host rules, just gotta write some tests now.
          Hide
          Steve Rowe added a comment - - edited

          New patch incorporating Uwe's JFlex TLD macro generation patch (with a few small adjustments), and also including a jflex grammar for a new class: NewStandardTokenizer. This grammar adds recognition of URLs, e-mail addresses, and host names and IP addresses (both v4 and v6) to the UAX29Tokenizer grammar.

          This is a work in progress – testing for http: scheme URLs and e-mail addresses is included, but there is no testing yet for the file:, https:, or ftp: schemes.

          I have dropped the idea of recognizing mailto: URIs, because these seem more complicated than they are worth (mailto: URIs can include multiple email addresses, comments, full email bodies, etc.). E-mail addresses within mailto: URIs should still be recognized.

          WARNING: I had to invoke Ant with a 900MB heap (ANT_OPTS=-Xmx900m ant jflex on Windows Vista, 64 bit Sun JDK 1.5.0_22) in order to allow the JFlex generation process to complete for NewStandardTokenizer; the process also took a minute or two to finish.

          edit: Sun 1. -> Sun JDK 1.5.0_22

          Show
          Steve Rowe added a comment - - edited New patch incorporating Uwe's JFlex TLD macro generation patch (with a few small adjustments), and also including a jflex grammar for a new class: NewStandardTokenizer. This grammar adds recognition of URLs, e-mail addresses, and host names and IP addresses (both v4 and v6) to the UAX29Tokenizer grammar. This is a work in progress – testing for http: scheme URLs and e-mail addresses is included, but there is no testing yet for the file: , https:, or ftp: schemes. I have dropped the idea of recognizing mailto: URIs, because these seem more complicated than they are worth (mailto: URIs can include multiple email addresses, comments, full email bodies, etc.). E-mail addresses within mailto: URIs should still be recognized. WARNING: I had to invoke Ant with a 900MB heap ( ANT_OPTS=-Xmx900m ant jflex on Windows Vista, 64 bit Sun JDK 1.5.0_22) in order to allow the JFlex generation process to complete for NewStandardTokenizer; the process also took a minute or two to finish. edit : Sun 1. -> Sun JDK 1.5.0_22
          Hide
          Steve Rowe added a comment -

          URL testing for NewStandardTokenizer is now complete.

          I have dropped the <HOST> token type, since it seems to me that, e.g., both of the following strings should be interpretable as URLs, given that they effectively refer to the same resource (when interpreted in the context of the HTTP URI scheme), and the first is clearly not just a host name:

          example.com/
          
          example.com
          

          Both of the above are now marked by NewStandardTokenizer with type <URL>.

          NewStandardTokenizer is not quite finished; I plan on stealing Robert's Southeast Asian (Lao, Myanmar, Khmer) syllabification routines from ICUTokenizer and incorporating them into NewStandardTokenizer. Once that's done, I think we can make NewStandardTokenizer the new StandardTokenizer.

          Show
          Steve Rowe added a comment - URL testing for NewStandardTokenizer is now complete. I have dropped the <HOST> token type, since it seems to me that, e.g., both of the following strings should be interpretable as URLs, given that they effectively refer to the same resource (when interpreted in the context of the HTTP URI scheme), and the first is clearly not just a host name: example.com/ example.com Both of the above are now marked by NewStandardTokenizer with type <URL>. NewStandardTokenizer is not quite finished; I plan on stealing Robert's Southeast Asian (Lao, Myanmar, Khmer) syllabification routines from ICUTokenizer and incorporating them into NewStandardTokenizer. Once that's done, I think we can make NewStandardTokenizer the new StandardTokenizer.
          Hide
          Robert Muir added a comment -

          NewStandardTokenizer is not quite finished; I plan on stealing Robert's Southeast Asian (Lao, Myanmar, Khmer) syllabification routine

          Curious, what is your plan here? Do you plan to somehow "jflex-#include" these into the grammar so that these are longest-matched instead of the Complex_Context rule?

          How to handle the cases where the grammar cannot be forward-only deterministic matching? (at least i don't see how it could be, but maybe). e.g. the lao cases where some backtracking is needed... and the combining class reordering needed for real-world text?

          Curious what would you plan to index for Thai, words? a grammar for TCC?

          Also, some of these syllable techniques are probably not very good for search without doing a "shingle" later... in some cases it may perform OK like single ideographs or tibetan syllables do with the grammar you have. For others (Khmer, etc) I think the shingling is likely mandatory since they are really only a bit better than indexing grapheme clusters.

          As far as needing punctuation for shingling, the similar problem already exists. For example, after tokenizing, some discarding of information (punctuation) has been lost and its too late to do a nice shingle. practical cheating/workarounds exist for CJK (you could look at the offset or something and cheat, to figure out that they were adjacent), but for something like Tibetan the type of punctuation itself is important: the tsheg being unambiguous syllable separator, but ambiguous word separator, but the shad or whitespace being both.

          Here is the paper I brought up at ehatcher's house recently when we were discussing tibetan, that recommends this syllable bigram technique, where the shingling is dependent on the original punctuation: http://terpconnect.umd.edu/~oard/pdf/iral00b.pdf

          One alternative for the short term would be to make a tokenfilter that hooks into the ICUTokenizer logic but looks for Complex_Context, or similar. I definitely agree it would be best if standardtokenizer worked the best out of box without doing something like this.

          Finally, I think its worth considering a lot of this as a special case of a larger problem that affects even english. For a lot of users, punctuation such as the hyphen in english might have some special meaning and they might want to shingle or something else in that case too. Its a general problem with tokenstreams that the tokenizer often discards this information and the filters are left with only a partial picture. Some ideas to improve it would be to make use of properties like [:Terminal_Punctuation=Yes:] somehow, or to try to integrate Sentence segmentation.

          Show
          Robert Muir added a comment - NewStandardTokenizer is not quite finished; I plan on stealing Robert's Southeast Asian (Lao, Myanmar, Khmer) syllabification routine Curious, what is your plan here? Do you plan to somehow "jflex-#include" these into the grammar so that these are longest-matched instead of the Complex_Context rule? How to handle the cases where the grammar cannot be forward-only deterministic matching? (at least i don't see how it could be, but maybe). e.g. the lao cases where some backtracking is needed... and the combining class reordering needed for real-world text? Curious what would you plan to index for Thai, words? a grammar for TCC? Also, some of these syllable techniques are probably not very good for search without doing a "shingle" later... in some cases it may perform OK like single ideographs or tibetan syllables do with the grammar you have. For others (Khmer, etc) I think the shingling is likely mandatory since they are really only a bit better than indexing grapheme clusters. As far as needing punctuation for shingling, the similar problem already exists. For example, after tokenizing, some discarding of information (punctuation) has been lost and its too late to do a nice shingle. practical cheating/workarounds exist for CJK (you could look at the offset or something and cheat, to figure out that they were adjacent), but for something like Tibetan the type of punctuation itself is important: the tsheg being unambiguous syllable separator, but ambiguous word separator, but the shad or whitespace being both. Here is the paper I brought up at ehatcher's house recently when we were discussing tibetan, that recommends this syllable bigram technique, where the shingling is dependent on the original punctuation: http://terpconnect.umd.edu/~oard/pdf/iral00b.pdf One alternative for the short term would be to make a tokenfilter that hooks into the ICUTokenizer logic but looks for Complex_Context, or similar. I definitely agree it would be best if standardtokenizer worked the best out of box without doing something like this. Finally, I think its worth considering a lot of this as a special case of a larger problem that affects even english. For a lot of users, punctuation such as the hyphen in english might have some special meaning and they might want to shingle or something else in that case too. Its a general problem with tokenstreams that the tokenizer often discards this information and the filters are left with only a partial picture. Some ideas to improve it would be to make use of properties like [:Terminal_Punctuation=Yes:] somehow, or to try to integrate Sentence segmentation.
          Hide
          Steve Rowe added a comment -

          NewStandardTokenizer is not quite finished; I plan on stealing Robert's Southeast Asian (Lao, Myanmar, Khmer) syllabification routine

          Curious, what is your plan here? Do you plan to somehow "jflex-#include" these into the grammar so that these are longest-matched instead of the Complex_Context rule?

          Sorry, I haven't looked at the details yet, but roughly, yes, what you said.

          How to handle the cases where the grammar cannot be forward-only deterministic matching? (at least i don't see how it could be, but maybe). e.g. the lao cases where some backtracking is needed... and the combining class reordering needed for real-world text?

          I was thinking of trying to make regex versions of all of these, and failing that, recognize chunks that need special handling, and do that outside of matching in methods in the tokenizer class.

          Curious what would you plan to index for Thai, words? a grammar for TCC?

          You had mentioned wanting to make a Thai syllabification routine - I was thinking that either you or I would do this.

          Also, some of these syllable techniques are probably not very good for search without doing a "shingle" later... in some cases it may perform OK like single ideographs or tibetan syllables do with the grammar you have. For others (Khmer, etc) I think the shingling is likely mandatory since they are really only a bit better than indexing grapheme clusters.

          I'm thinking of leaving shingling for later, using the conditional branching filter idea (LUCENE-2470) based on token type.

          As far as needing punctuation for shingling, the similar problem already exists. For example, after tokenizing, some discarding of information (punctuation) has been lost and its too late to do a nice shingle. practical cheating/workarounds exist for CJK (you could look at the offset or something and cheat, to figure out that they were adjacent), but for something like Tibetan the type of punctuation itself is important: the tsheg being unambiguous syllable separator, but ambiguous word separator, but the shad or whitespace being both.

          You're arguing either for in-tokenizer shingling or passing non-tokenized data out of the tokenizer in addition to the tokens. Hmm.

          Here is the paper I brought up at ehatcher's house recently when we were discussing tibetan, that recommends this syllable bigram technique, where the shingling is dependent on the original punctuation: http://terpconnect.umd.edu/~oard/pdf/iral00b.pdf

          Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well.

          One alternative for the short term would be to make a tokenfilter that hooks into the ICUTokenizer logic but looks for Complex_Context, or similar. I definitely agree it would be best if standardtokenizer worked the best out of box without doing something like this.

          Yeah, I'd rather build it into the new StandardTokenizer.

          Finally, I think its worth considering a lot of this as a special case of a larger problem that affects even english. For a lot of users, punctuation such as the hyphen in english might have some special meaning and they might want to shingle or something else in that case too. Its a general problem with tokenstreams that the tokenizer often discards this information and the filters are left with only a partial picture. Some ideas to improve it would be to make use of properties like [:Terminal_Punctuation=Yes:] somehow, or to try to integrate Sentence segmentation.

          I don't understand how Sentence segmentation could help?

          One other possibility is to return everything from the tokenizer, marking the non-tokens with an appropriate type, similar to how the ICU tokenizer works. This has the unfortunate side effect of requiring post-tokenization filtering to discard non-tokens.

          Show
          Steve Rowe added a comment - NewStandardTokenizer is not quite finished; I plan on stealing Robert's Southeast Asian (Lao, Myanmar, Khmer) syllabification routine Curious, what is your plan here? Do you plan to somehow "jflex-#include" these into the grammar so that these are longest-matched instead of the Complex_Context rule? Sorry, I haven't looked at the details yet, but roughly, yes, what you said. How to handle the cases where the grammar cannot be forward-only deterministic matching? (at least i don't see how it could be, but maybe). e.g. the lao cases where some backtracking is needed... and the combining class reordering needed for real-world text? I was thinking of trying to make regex versions of all of these, and failing that, recognize chunks that need special handling, and do that outside of matching in methods in the tokenizer class. Curious what would you plan to index for Thai, words? a grammar for TCC? You had mentioned wanting to make a Thai syllabification routine - I was thinking that either you or I would do this. Also, some of these syllable techniques are probably not very good for search without doing a "shingle" later... in some cases it may perform OK like single ideographs or tibetan syllables do with the grammar you have. For others (Khmer, etc) I think the shingling is likely mandatory since they are really only a bit better than indexing grapheme clusters. I'm thinking of leaving shingling for later, using the conditional branching filter idea ( LUCENE-2470 ) based on token type. As far as needing punctuation for shingling, the similar problem already exists. For example, after tokenizing, some discarding of information (punctuation) has been lost and its too late to do a nice shingle. practical cheating/workarounds exist for CJK (you could look at the offset or something and cheat, to figure out that they were adjacent), but for something like Tibetan the type of punctuation itself is important: the tsheg being unambiguous syllable separator, but ambiguous word separator, but the shad or whitespace being both. You're arguing either for in-tokenizer shingling or passing non-tokenized data out of the tokenizer in addition to the tokens. Hmm. Here is the paper I brought up at ehatcher's house recently when we were discussing tibetan, that recommends this syllable bigram technique, where the shingling is dependent on the original punctuation: http://terpconnect.umd.edu/~oard/pdf/iral00b.pdf Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well. One alternative for the short term would be to make a tokenfilter that hooks into the ICUTokenizer logic but looks for Complex_Context, or similar. I definitely agree it would be best if standardtokenizer worked the best out of box without doing something like this. Yeah, I'd rather build it into the new StandardTokenizer. Finally, I think its worth considering a lot of this as a special case of a larger problem that affects even english. For a lot of users, punctuation such as the hyphen in english might have some special meaning and they might want to shingle or something else in that case too. Its a general problem with tokenstreams that the tokenizer often discards this information and the filters are left with only a partial picture. Some ideas to improve it would be to make use of properties like [:Terminal_Punctuation=Yes:] somehow, or to try to integrate Sentence segmentation. I don't understand how Sentence segmentation could help? One other possibility is to return everything from the tokenizer, marking the non-tokens with an appropriate type, similar to how the ICU tokenizer works. This has the unfortunate side effect of requiring post-tokenization filtering to discard non-tokens.
          Hide
          Robert Muir added a comment -

          You had mentioned wanting to make a Thai syllabification routine - I was thinking that either you or I would do this.

          OK, this makes sense.

          You're arguing either for in-tokenizer shingling or passing non-tokenized data out of the tokenizer in addition to the tokens. Hmm.

          Or attributes that mark sentence boundaries. or bumped position increments for sentence boundaries (that also prevent phrase searches across sentences). or maybe other ideas.

          Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well.

          Careful, the way they did the measurement only tells us that neither one is absolute shit, but i dont think its clear yet they are equal.
          either way, the argument in the paper is for bigrams (n=2)... how is this quadrupled index size? its just like CJKTokenizer...

          I don't understand how Sentence segmentation could help?

          One other possibility is to return everything from the tokenizer, marking the non-tokens with an appropriate type, similar to how the ICU tokenizer works. This has the unfortunate side effect of requiring post-tokenization filtering to discard non-tokens.

          Right, but it could be attributes or position increments for sentence boundaries too. then you just wouldnt shingle across missing position increments, and phrase queries wouldnt match across sentence boundaries either.

          In my opinion, I think the patch here already solves a lot of problems on its own, and I suggest we explore these ideas later (including thai etc) in a separate issue. With the patch as-is now, people can use the ThaiWordFilter. If they need support for the other languages, they have ICUTokenizer as a workaround. We could think about how to do the more complex stuff in more general ways (sentence seg., conditional branching, etc).

          In general i'd like to think that UAX#29 sentence segmentation, implemented nicely, would be a cool feature that could help with some of these problems, and maybe other problems too. Perhaps it could be re-used by highlighting etc as well.

          Show
          Robert Muir added a comment - You had mentioned wanting to make a Thai syllabification routine - I was thinking that either you or I would do this. OK, this makes sense. You're arguing either for in-tokenizer shingling or passing non-tokenized data out of the tokenizer in addition to the tokens. Hmm. Or attributes that mark sentence boundaries. or bumped position increments for sentence boundaries (that also prevent phrase searches across sentences). or maybe other ideas. Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well. Careful, the way they did the measurement only tells us that neither one is absolute shit, but i dont think its clear yet they are equal. either way, the argument in the paper is for bigrams (n=2)... how is this quadrupled index size? its just like CJKTokenizer... I don't understand how Sentence segmentation could help? One other possibility is to return everything from the tokenizer, marking the non-tokens with an appropriate type, similar to how the ICU tokenizer works. This has the unfortunate side effect of requiring post-tokenization filtering to discard non-tokens. Right, but it could be attributes or position increments for sentence boundaries too. then you just wouldnt shingle across missing position increments, and phrase queries wouldnt match across sentence boundaries either. In my opinion, I think the patch here already solves a lot of problems on its own, and I suggest we explore these ideas later (including thai etc) in a separate issue. With the patch as-is now, people can use the ThaiWordFilter. If they need support for the other languages, they have ICUTokenizer as a workaround. We could think about how to do the more complex stuff in more general ways (sentence seg., conditional branching, etc). In general i'd like to think that UAX#29 sentence segmentation, implemented nicely, would be a cool feature that could help with some of these problems, and maybe other problems too. Perhaps it could be re-used by highlighting etc as well.
          Hide
          Steve Rowe added a comment -

          Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well.

          Careful, the way they did the measurement only tells us that neither one is absolute shit, but i dont think its clear yet they are equal.
          either way, the argument in the paper is for bigrams (n=2)...

          Yes, you're right - fine-grained performance comparisons are inappropriate here. You've said for other language(s?) that unigram/bigram combo works best - too bad they didn't test that here.

          how is this quadrupled index size? its just like CJKTokenizer...

          From the paper:

          As has been observed in other languages [Miller et al., 2000], ngram indexing resulted in explosive growth in the number of terms with increasing n. The index size for word-based indexing was less than one quarter of that of syllable bigrams.

          In general i'd like to think that UAX#29 sentence segmentation, implemented nicely, would be a cool feature that could help with some of these problems, and maybe other problems too.

          You mentioned it would be useful to eliminate phrase matches across sentence boundaries - what other problems would it solve?

          Show
          Steve Rowe added a comment - Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well. Careful, the way they did the measurement only tells us that neither one is absolute shit, but i dont think its clear yet they are equal. either way, the argument in the paper is for bigrams (n=2)... Yes, you're right - fine-grained performance comparisons are inappropriate here. You've said for other language(s?) that unigram/bigram combo works best - too bad they didn't test that here. how is this quadrupled index size? its just like CJKTokenizer... From the paper: As has been observed in other languages [Miller et al., 2000] , ngram indexing resulted in explosive growth in the number of terms with increasing n. The index size for word-based indexing was less than one quarter of that of syllable bigrams. In general i'd like to think that UAX#29 sentence segmentation, implemented nicely, would be a cool feature that could help with some of these problems, and maybe other problems too. You mentioned it would be useful to eliminate phrase matches across sentence boundaries - what other problems would it solve?
          Hide
          Robert Muir added a comment -

          Yes, you're right - fine-grained performance comparisons are inappropriate here. You've said for other language(s?) that unigram/bigram combo works best - too bad they didn't test that here.

          agreed!

          You mentioned it would be useful to eliminate phrase matches across sentence boundaries - what other problems would it solve?

          in addition to inhibiting phrase matches, the sentence boundaries themselves (however we would represent them) could be used by later filters: such as inhibiting shingle generation, inhibiting multi-word synonym generation, ... I am sure there are some other ways too that don't immediately come to mind.

          at the moment the cleanest way I can think of doing this would be to bump the position increment, but who knows. there doesnt' seem to be a de-facto way of doing this, since nothing in lucene out of box implements or uses sentence boundaries really, which is sad!

          Show
          Robert Muir added a comment - Yes, you're right - fine-grained performance comparisons are inappropriate here. You've said for other language(s?) that unigram/bigram combo works best - too bad they didn't test that here. agreed! You mentioned it would be useful to eliminate phrase matches across sentence boundaries - what other problems would it solve? in addition to inhibiting phrase matches, the sentence boundaries themselves (however we would represent them) could be used by later filters: such as inhibiting shingle generation, inhibiting multi-word synonym generation, ... I am sure there are some other ways too that don't immediately come to mind. at the moment the cleanest way I can think of doing this would be to bump the position increment, but who knows. there doesnt' seem to be a de-facto way of doing this, since nothing in lucene out of box implements or uses sentence boundaries really, which is sad!
          Hide
          Robert Muir added a comment -

          by the way Steven, one alternative idea i had before for this was to have a jflex or rbbi-powered charfilter for sentences.

          you could provide it with string constants in the ctor to replace sentence boundaries, to add position increments just add these to your stopfilter.

          the advantage to this would be that you could use it with other tokenizers by using this special token (i guess just be careful which one you use!).

          sorry to stray off topic a bit with this, but i think its sorta a missing piece thats relevant and becomes more important with ComplexContext

          Show
          Robert Muir added a comment - by the way Steven, one alternative idea i had before for this was to have a jflex or rbbi-powered charfilter for sentences. you could provide it with string constants in the ctor to replace sentence boundaries, to add position increments just add these to your stopfilter. the advantage to this would be that you could use it with other tokenizers by using this special token (i guess just be careful which one you use!). sorry to stray off topic a bit with this, but i think its sorta a missing piece thats relevant and becomes more important with ComplexContext
          Hide
          Steve Rowe added a comment -

          I'm looking at UAX#29 sentence breaking rules, and this one looks suspicious to me:

          Break after paragraph separators.
          SB4. Sep | CR | LF ÷

          Lots of text I look at includes newlines that don't indicate paragraph boundaries. In the implementations of sentence breaking that I've done, I always use double newlines for this purpose. Thoughts?

          I'm thinking that it would be difficult to (correctly) incorporate sentence-boundary rules directly into the existing word-boundary rules. Maybe a two-pass arrangement, where the sentence-boundary detector passes sentences as complete inputs to a word-boundary detector?

          Show
          Steve Rowe added a comment - I'm looking at UAX#29 sentence breaking rules, and this one looks suspicious to me: Break after paragraph separators. SB4. Sep | CR | LF ÷ Lots of text I look at includes newlines that don't indicate paragraph boundaries. In the implementations of sentence breaking that I've done, I always use double newlines for this purpose. Thoughts? I'm thinking that it would be difficult to (correctly) incorporate sentence-boundary rules directly into the existing word-boundary rules. Maybe a two-pass arrangement, where the sentence-boundary detector passes sentences as complete inputs to a word-boundary detector?
          Hide
          Robert Muir added a comment -

          Lots of text I look at includes newlines that don't indicate paragraph boundaries.

          What is this text? Some manually-wrapped text?

          I mean, i guess the whole point is a reasonable default, yet tailorable with a grammar.

          Maybe a two-pass arrangement, where the sentence-boundary detector passes sentences as complete inputs to a word-boundary detector?

          Well this is why i liked the charfilter idea. then its separate and optional, and you can do what you want with the sentence boundary indicator strings.

          Show
          Robert Muir added a comment - Lots of text I look at includes newlines that don't indicate paragraph boundaries. What is this text? Some manually-wrapped text? I mean, i guess the whole point is a reasonable default, yet tailorable with a grammar. Maybe a two-pass arrangement, where the sentence-boundary detector passes sentences as complete inputs to a word-boundary detector? Well this is why i liked the charfilter idea. then its separate and optional, and you can do what you want with the sentence boundary indicator strings.
          Hide
          Steve Rowe added a comment -

          by the way Steven, one alternative idea i had before for this was to have a jflex or rbbi-powered charfilter for sentences.

          nice idea - composition becomes simpler.

          you could provide it with string constants in the ctor to replace sentence boundaries, to add position increments just add these to your stopfilter.

          the advantage to this would be that you could use it with other tokenizers by using this special token (i guess just be careful which one you use!).

          Why not just insert U+2029 PARAGRAPH SEPARATOR (PS)? Then it will also trigger word boundaries, and tokenizers that care about appropriately responding to it can specialize for just this one, instead of having to also be aware of whatever it was that the user specified in the ctor to the charfilter.

          sorry to stray off topic a bit with this, but i think its sorta a missing piece thats relevant and becomes more important with ComplexContext

          I like where this is going - toward a solid general solution.

          Lots of text I look at includes newlines that don't indicate paragraph boundaries.

          What is this text? Some manually-wrapped text?

          Email. Source code. TREC collections (I think - don't have any right here with me). And yes, manually generated and wrapped text. Isn't most text manually generated?

          Show
          Steve Rowe added a comment - by the way Steven, one alternative idea i had before for this was to have a jflex or rbbi-powered charfilter for sentences. nice idea - composition becomes simpler. you could provide it with string constants in the ctor to replace sentence boundaries, to add position increments just add these to your stopfilter. the advantage to this would be that you could use it with other tokenizers by using this special token (i guess just be careful which one you use!). Why not just insert U+2029 PARAGRAPH SEPARATOR (PS) ? Then it will also trigger word boundaries, and tokenizers that care about appropriately responding to it can specialize for just this one, instead of having to also be aware of whatever it was that the user specified in the ctor to the charfilter. sorry to stray off topic a bit with this, but i think its sorta a missing piece thats relevant and becomes more important with ComplexContext I like where this is going - toward a solid general solution. Lots of text I look at includes newlines that don't indicate paragraph boundaries. What is this text? Some manually-wrapped text? Email. Source code. TREC collections (I think - don't have any right here with me). And yes, manually generated and wrapped text. Isn't most text manually generated?
          Hide
          Robert Muir added a comment -

          Why not just insert U+2029 PARAGRAPH SEPARATOR (PS)?

          I would argue because its a sentence boundary, not a paragraph boundary

          But i thought it would be best to just allow the user to specify the replacement string (which could be just U+2029 if you want).
          They could also use "<boundary/>" or something entirely different.

          and tokenizers that care about appropriately responding to it can specialize for just this one, instead of having to also be aware of whatever it was that the user specified in the ctor to the charfilter.

          well, by default these filters could just work with position increments appropriately, and you add whatever string you use to a stopword filter to create these position increments.

          I like where this is going - toward a solid general solution.

          Good, if we get some sorta plan we should open a new JIRA issue i think.

          Email. Source code. TREC collections (I think - don't have any right here with me). And yes, manually generated and wrapped text. Isn't most text manually generated?

          Right, but unicode encodes character So things like text wrapping in my opinion belongs in the display component, and not in a character encoding model... most modern text in webpages etc isnt manually wrapped like this.

          I think our default implementation should be for unicode text. for the non-unicode text you speak of, you can just tailor the default rules.

          Show
          Robert Muir added a comment - Why not just insert U+2029 PARAGRAPH SEPARATOR (PS)? I would argue because its a sentence boundary, not a paragraph boundary But i thought it would be best to just allow the user to specify the replacement string (which could be just U+2029 if you want). They could also use "<boundary/>" or something entirely different. and tokenizers that care about appropriately responding to it can specialize for just this one, instead of having to also be aware of whatever it was that the user specified in the ctor to the charfilter. well, by default these filters could just work with position increments appropriately, and you add whatever string you use to a stopword filter to create these position increments. I like where this is going - toward a solid general solution. Good, if we get some sorta plan we should open a new JIRA issue i think. Email. Source code. TREC collections (I think - don't have any right here with me). And yes, manually generated and wrapped text. Isn't most text manually generated? Right, but unicode encodes character So things like text wrapping in my opinion belongs in the display component, and not in a character encoding model... most modern text in webpages etc isnt manually wrapped like this. I think our default implementation should be for unicode text. for the non-unicode text you speak of, you can just tailor the default rules.
          Hide
          Steve Rowe added a comment -

          Ok, so for sentence boundaries, we're talking about a separate composable implementation.

          What, then, will the replacement for StandardAnalyzer be? This issue needs to include a substitute definition when replacing StandardTokenizer.

          Which of these should be included, in addition to NewStandardTokenizer?:

          1. SentenceBoundaryCharFilter (clunky name, but descriptive)
          2. LowerCaseFilter
          3. StopFilter
          Show
          Steve Rowe added a comment - Ok, so for sentence boundaries, we're talking about a separate composable implementation. What, then, will the replacement for StandardAnalyzer be? This issue needs to include a substitute definition when replacing StandardTokenizer. Which of these should be included, in addition to NewStandardTokenizer?: SentenceBoundaryCharFilter (clunky name, but descriptive) LowerCaseFilter StopFilter
          Hide
          Robert Muir added a comment -

          Which of these should be included, in addition to NewStandardTokenizer?:

          I would say only lowercase + stop.

          the charfilter would just be another optional charfilter, like html-stripping.
          I don't think it should be enabled by standardanalyzer by default, especially for performance reasons.

          Show
          Robert Muir added a comment - Which of these should be included, in addition to NewStandardTokenizer?: I would say only lowercase + stop. the charfilter would just be another optional charfilter, like html-stripping. I don't think it should be enabled by standardanalyzer by default, especially for performance reasons.
          Hide
          Robert Muir added a comment -

          OK I created LUCENE-2498 for the sentence boundary charfilter idea.

          I think this is really unrelated to standardtokenizer except that its also using the same jflex functionality

          Show
          Robert Muir added a comment - OK I created LUCENE-2498 for the sentence boundary charfilter idea. I think this is really unrelated to standardtokenizer except that its also using the same jflex functionality
          Hide
          Steve Rowe added a comment -

          This is the benchmarking patch brought up-to-date with trunk, and with NewStandardTokenizer added in to the list of tested tokenizers.

          Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five):

          Operation recsPerRun rec/s elapsedSec
          NewStandardTokenizer 1268450 654,852.88 1.94
          UAX29Tokenizer 1268451 679,042.31 1.87
          StandardTokenizer 1262799 680,021.00 1.86
          RBBITokenizer 1268451 575,261.25 2.20
          ICUTokenizer 1268451 557,315.88 2.28

          NewStandardTokenizer is consistently slower than UAX29Tokenizer and StandardTokenizer, but still faster than the ICU implementation; it appears that URL and Email tokenization have slowed things down a little bit. IMHO, recognizing them is worth taking a small hit in throughput.

          Show
          Steve Rowe added a comment - This is the benchmarking patch brought up-to-date with trunk, and with NewStandardTokenizer added in to the list of tested tokenizers. Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five): Operation recsPerRun rec/s elapsedSec NewStandardTokenizer 1268450 654,852.88 1.94 UAX29Tokenizer 1268451 679,042.31 1.87 StandardTokenizer 1262799 680,021.00 1.86 RBBITokenizer 1268451 575,261.25 2.20 ICUTokenizer 1268451 557,315.88 2.28 NewStandardTokenizer is consistently slower than UAX29Tokenizer and StandardTokenizer, but still faster than the ICU implementation; it appears that URL and Email tokenization have slowed things down a little bit. IMHO, recognizing them is worth taking a small hit in throughput.
          Hide
          Steve Rowe added a comment -

          After a discussion with Robert on #lucene, I think this issue is complete - we can add more stuff later in a separate issue.

          Show
          Steve Rowe added a comment - After a discussion with Robert on #lucene, I think this issue is complete - we can add more stuff later in a separate issue.
          Hide
          Robert Muir added a comment -

          zip file of my current integration progress.

          the zip file is relevant to modules/analysis/common.

          not all the tests pass as we have to figure a few things out...
          the first thing to figure out is TestEmails/Urls in TestStandardAnalyzer (currently commented out)

          the problem is how to get the bracketed rules to work without actually including the brackets in the tokens, while using StandardTokenizerInterface.

          Show
          Robert Muir added a comment - zip file of my current integration progress. the zip file is relevant to modules/analysis/common. not all the tests pass as we have to figure a few things out... the first thing to figure out is TestEmails/Urls in TestStandardAnalyzer (currently commented out) the problem is how to get the bracketed rules to work without actually including the brackets in the tokens, while using StandardTokenizerInterface.
          Hide
          Steve Rowe added a comment -

          Robert,

          Special handling for bracketed URLs makes no sense - that rule can be dropped.

          Bracketed emails are useful, though, since the domain in the host portion doesn't need to be a registerable domain. I think this could be handled with two changes to the bracketed email rule. Here it is in the form you wrote:

          "<" {EMAILaddressLoose} ">" { return EMAIL_TYPE; }
          

          Here is my suggestion:

          "<" {EMAILaddressLoose} / ">" { ++zzStartRead; return EMAIL_TYPE; }
          

          This combines incrementing the start of the matched region (++zzStartRead;) and lookahead for the trailing angle bracket (/ ">"). AFAICT, directly modifying zzStartRead shouldn't cause any problems. After this rule completes, the trailing angle bracket will be at the beginning of the remaining text to be matched.

          Show
          Steve Rowe added a comment - Robert, Special handling for bracketed URLs makes no sense - that rule can be dropped. Bracketed emails are useful, though, since the domain in the host portion doesn't need to be a registerable domain. I think this could be handled with two changes to the bracketed email rule. Here it is in the form you wrote: "<" {EMAILaddressLoose} ">" { return EMAIL_TYPE; } Here is my suggestion: "<" {EMAILaddressLoose} / ">" { ++zzStartRead; return EMAIL_TYPE; } This combines incrementing the start of the matched region ( ++zzStartRead; ) and lookahead for the trailing angle bracket ( / ">" ). AFAICT, directly modifying zzStartRead shouldn't cause any problems. After this rule completes, the trailing angle bracket will be at the beginning of the remaining text to be matched.
          Hide
          Robert Muir added a comment -

          ok here is a patch file. before applying it, you have to run these commands:

          # original grammar -> ClassicTokenizerImpl
          svn move modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.java
          svn move modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.jflex modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.jflex
          # this one is not needed, this patch becomes the new grammar
          svn delete modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.java
          svn delete modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.jflex
          # expose the old tokenizer, not just via Version, but also as ClassicAnalyzer/Tokenizer/Filter
          svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicAnalyzer.java
          svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizer.java
          svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicFilter.java
          svn copy modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestStandardAnalyzer.java modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestClassicAnalyzer.java
          # temporarily edit solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java (change the $Id hossman.... to just $Id$)
          # apply the patch.
          

          if you want to iterate on the patch, make your changes and generate a patch with 'svn --no-diff-deleted'.

          some notes:

          • patch is against 4.0, but i think we can do this in 3.1. all the back compat is preserved, etc. we just gotta figure a few things out. all the tests pass though.
          • The patch is large mainly because of the DFA size. I have some concerns about this... the email/url stuff seems to be the culprit, as the UAX#29 generated class is only 12KB, about the same size as our existing standardtokenizer.
          • I gave backwards compat (you get the old behavior) with Version, but also setup ClassicAnalyzer/Tokenizer/Filter for those that want the...not so international-friendly old version, for its company Identification, etc.
          • I modified token types for icu to be more consistent with this.
          • StandardFilter is currently a no-op for the new grammar. In my opinion this is a place to implement the 'more sophisticated' logic that the standard mentions for certain scripts. We can use token types (IDEOGRAPHIC, SOUTHEAST_ASIAN) to drive this. This way the standardanalyzer is a reasonable tokenizer for most languages.

          So, not completely sure this is the best approach, but it is one... the patch is still rough around the edges but at least now we can iterate more easily on it.

          Show
          Robert Muir added a comment - ok here is a patch file. before applying it, you have to run these commands: # original grammar -> ClassicTokenizerImpl svn move modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.java svn move modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.jflex modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.jflex # this one is not needed, this patch becomes the new grammar svn delete modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.java svn delete modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.jflex # expose the old tokenizer, not just via Version, but also as ClassicAnalyzer/Tokenizer/Filter svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicAnalyzer.java svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizer.java svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicFilter.java svn copy modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestStandardAnalyzer.java modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestClassicAnalyzer.java # temporarily edit solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java (change the $Id hossman.... to just $Id$) # apply the patch. if you want to iterate on the patch, make your changes and generate a patch with 'svn --no-diff-deleted'. some notes: patch is against 4.0, but i think we can do this in 3.1. all the back compat is preserved, etc. we just gotta figure a few things out. all the tests pass though. The patch is large mainly because of the DFA size. I have some concerns about this... the email/url stuff seems to be the culprit, as the UAX#29 generated class is only 12KB, about the same size as our existing standardtokenizer. I gave backwards compat (you get the old behavior) with Version, but also setup ClassicAnalyzer/Tokenizer/Filter for those that want the...not so international-friendly old version, for its company Identification, etc. I modified token types for icu to be more consistent with this. StandardFilter is currently a no-op for the new grammar. In my opinion this is a place to implement the 'more sophisticated' logic that the standard mentions for certain scripts. We can use token types (IDEOGRAPHIC, SOUTHEAST_ASIAN) to drive this. This way the standardanalyzer is a reasonable tokenizer for most languages. So, not completely sure this is the best approach, but it is one... the patch is still rough around the edges but at least now we can iterate more easily on it.
          Hide
          Robert Muir added a comment -

          attached is an updated patch.

          Steven and I debugged the large DFA size and reduced it somewhat (.class file drops from 167,945 bytes to 52,399 bytes).

          Show
          Robert Muir added a comment - attached is an updated patch. Steven and I debugged the large DFA size and reduced it somewhat (.class file drops from 167,945 bytes to 52,399 bytes).
          Hide
          Steve Rowe added a comment -

          Attaching benchmark patch brought up-to-date with Robert's last patch.

          Here are the current results on my machine:

          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 1262799 644,943.31 1.96
          ICUTokenizer 1268451 546,040.06 2.32
          RBBITokenizer 1268451 570,090.31 2.22
          StandardTokenizer 1268450 659,963.56 1.92
          UAX29Tokenizer 1268451 643,883.75 1.97
          Show
          Steve Rowe added a comment - Attaching benchmark patch brought up-to-date with Robert's last patch. Here are the current results on my machine: Operation recsPerRun rec/s elapsedSec ClassicTokenizer 1262799 644,943.31 1.96 ICUTokenizer 1268451 546,040.06 2.32 RBBITokenizer 1268451 570,090.31 2.22 StandardTokenizer 1268450 659,963.56 1.92 UAX29Tokenizer 1268451 643,883.75 1.97
          Hide
          Robert Muir added a comment -

          Thanks Steven! Looks to me like we have resolved the perf problem?!

          Show
          Robert Muir added a comment - Thanks Steven! Looks to me like we have resolved the perf problem?!
          Hide
          Steve Rowe added a comment -

          Thanks Steven! Looks to me like we have resolved the perf problem?!

          I don't know... I'll run it a few more times tonight and see if it's consistent.

          Show
          Steve Rowe added a comment - Thanks Steven! Looks to me like we have resolved the perf problem?! I don't know... I'll run it a few more times tonight and see if it's consistent.
          Hide
          Steve Rowe added a comment -

          I ran it three more times, and it appears that the difference between ClassicTokenizer, UAX29Tokenizer, and the new StandardTokenizer is in the noise:

          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 1262799 665,682.12 1.90
          ICUTokenizer 1268451 553,666.94 2.29
          RBBITokenizer 1268451 575,261.25 2.20
          StandardTokenizer 1268450 658,935.06 1.92
          UAX29Tokenizer 1268451 642,579.00 1.97
          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 1262799 668,501.31 1.89
          ICUTokenizer 1268451 546,275.19 2.32
          RBBITokenizer 1268451 563,255.31 2.25
          StandardTokenizer 1268450 651,824.25 1.95
          UAX29Tokenizer 1268451 664,806.62 1.91
          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 1262799 674,932.69 1.87
          ICUTokenizer 1268451 541,841.50 2.34
          RBBITokenizer 1268451 586,431.38 2.16
          StandardTokenizer 1268450 635,814.56 2.00
          UAX29Tokenizer 1268451 650,487.69 1.95
          Show
          Steve Rowe added a comment - I ran it three more times, and it appears that the difference between ClassicTokenizer, UAX29Tokenizer, and the new StandardTokenizer is in the noise: Operation recsPerRun rec/s elapsedSec ClassicTokenizer 1262799 665,682.12 1.90 ICUTokenizer 1268451 553,666.94 2.29 RBBITokenizer 1268451 575,261.25 2.20 StandardTokenizer 1268450 658,935.06 1.92 UAX29Tokenizer 1268451 642,579.00 1.97 Operation recsPerRun rec/s elapsedSec ClassicTokenizer 1262799 668,501.31 1.89 ICUTokenizer 1268451 546,275.19 2.32 RBBITokenizer 1268451 563,255.31 2.25 StandardTokenizer 1268450 651,824.25 1.95 UAX29Tokenizer 1268451 664,806.62 1.91 Operation recsPerRun rec/s elapsedSec ClassicTokenizer 1262799 674,932.69 1.87 ICUTokenizer 1268451 541,841.50 2.34 RBBITokenizer 1268451 586,431.38 2.16 StandardTokenizer 1268450 635,814.56 2.00 UAX29Tokenizer 1268451 650,487.69 1.95
          Hide
          Steve Rowe added a comment -

          I tried increasing the number of documents in the benchmark alg from 10k to 50k, but apparently 50k docs was too much to fit into my OS FS cache, because it thrashed the whole time, and performance was more than an order of magnitude worse.

          I increased the number of rounds from 5 to 25, and increased the number of documents from 10k to 20k - below are three runs with these settings:

          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 2467769 669,134.75 3.69
          ICUTokenizer 2481688 548,924.56 4.52
          RBBITokenizer 2481688 573,270.50 4.33
          StandardTokenizer 2481687 656,704.69 3.78
          UAX29Tokenizer 2481688 661,254.44 3.75
          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 2467769 667,867.12 3.69
          ICUTokenizer 2481688 546,025.94 4.54
          RBBITokenizer 2481688 576,466.44 4.30
          StandardTokenizer 2481687 656,878.50 3.78
          UAX29Tokenizer 2481688 665,510.31 3.73
          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 2467769 664,092.81 3.72
          ICUTokenizer 2481688 551,486.25 4.50
          RBBITokenizer 2481688 581,191.56 4.27
          StandardTokenizer 2481687 655,317.38 3.79
          UAX29Tokenizer 2481688 663,021.12 3.74

          These are more consistent. I think the ~3% performance hit for the new StandardTokenizer over ClassicTokenizer is acceptable.

          Show
          Steve Rowe added a comment - I tried increasing the number of documents in the benchmark alg from 10k to 50k, but apparently 50k docs was too much to fit into my OS FS cache, because it thrashed the whole time, and performance was more than an order of magnitude worse. I increased the number of rounds from 5 to 25, and increased the number of documents from 10k to 20k - below are three runs with these settings: Operation recsPerRun rec/s elapsedSec ClassicTokenizer 2467769 669,134.75 3.69 ICUTokenizer 2481688 548,924.56 4.52 RBBITokenizer 2481688 573,270.50 4.33 StandardTokenizer 2481687 656,704.69 3.78 UAX29Tokenizer 2481688 661,254.44 3.75 Operation recsPerRun rec/s elapsedSec ClassicTokenizer 2467769 667,867.12 3.69 ICUTokenizer 2481688 546,025.94 4.54 RBBITokenizer 2481688 576,466.44 4.30 StandardTokenizer 2481687 656,878.50 3.78 UAX29Tokenizer 2481688 665,510.31 3.73 Operation recsPerRun rec/s elapsedSec ClassicTokenizer 2467769 664,092.81 3.72 ICUTokenizer 2481688 551,486.25 4.50 RBBITokenizer 2481688 581,191.56 4.27 StandardTokenizer 2481687 655,317.38 3.79 UAX29Tokenizer 2481688 663,021.12 3.74 These are more consistent. I think the ~3% performance hit for the new StandardTokenizer over ClassicTokenizer is acceptable.
          Hide
          Robert Muir added a comment -

          Steven, thanks for all these benchmarks.

          I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though).

          I have a few more questions:

          • Are there still ipv6 issues you wanted to address? I cant remember (lost in the std documents) but I think you found grammar improvements?
          • What about standardfilter with the new scheme? The previous impl does some 'cleanup' on the tokenizer, in the latest patch its a TODO/no-op for Version >= 3.1. Are there any email/url/other things we need to do here? on the unicode side, i think if we want to do anything here, it should be the "more sophisticated mechanism" for the SE asian (as then its name Standard would also make sense)... leave as a no-op for now with Version >= 3.1?
          Show
          Robert Muir added a comment - Steven, thanks for all these benchmarks. I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though). I have a few more questions: Are there still ipv6 issues you wanted to address? I cant remember (lost in the std documents) but I think you found grammar improvements? What about standardfilter with the new scheme? The previous impl does some 'cleanup' on the tokenizer, in the latest patch its a TODO/no-op for Version >= 3.1. Are there any email/url/other things we need to do here? on the unicode side, i think if we want to do anything here, it should be the "more sophisticated mechanism" for the SE asian (as then its name Standard would also make sense)... leave as a no-op for now with Version >= 3.1?
          Hide
          Steve Rowe added a comment -

          I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though).

          To address the DFA size I want to try your previous suggestion of a simpler IPv6 regex in the JFlex grammar, then full validation in the action via a java.util.regex NFA. You've previously said that you thought returning a new type like INVALID_URL would be fine, but I'd prefer not to do that - I want to try backing out and trying an alternate path if this action-based validation fails.

          What about standardfilter with the new scheme?

          I don't have an opinion on this one, except that it seems a little weird to have a no-op filter in the standard analyzer chain.

          Show
          Steve Rowe added a comment - I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though). To address the DFA size I want to try your previous suggestion of a simpler IPv6 regex in the JFlex grammar, then full validation in the action via a java.util.regex NFA. You've previously said that you thought returning a new type like INVALID_URL would be fine, but I'd prefer not to do that - I want to try backing out and trying an alternate path if this action-based validation fails. What about standardfilter with the new scheme? I don't have an opinion on this one, except that it seems a little weird to have a no-op filter in the standard analyzer chain.
          Hide
          Steve Rowe added a comment - - edited

          I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though).

          To address the DFA size I want to try your previous suggestion of a simpler IPv6 regex in the JFlex grammar, then full validation in the action via a java.util.regex NFA. You've previously said that you thought returning a new type like INVALID_URL would be fine, but I'd prefer not to do that - I want to try backing out and trying an alternate path if this action-based validation fails.

          The attached StandardTokenizerImpl.jflex is the result of my attempt, which appears to be successful - tests all pass.

          However, the resultant .class file size is even larger than before: 67,947 bytes.

          I give up: I think we should go with the full-blown IPv6 regex as part of the DFA.

          Show
          Steve Rowe added a comment - - edited I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though). To address the DFA size I want to try your previous suggestion of a simpler IPv6 regex in the JFlex grammar, then full validation in the action via a java.util.regex NFA. You've previously said that you thought returning a new type like INVALID_URL would be fine, but I'd prefer not to do that - I want to try backing out and trying an alternate path if this action-based validation fails. The attached StandardTokenizerImpl.jflex is the result of my attempt, which appears to be successful - tests all pass. However, the resultant .class file size is even larger than before: 67,947 bytes. I give up: I think we should go with the full-blown IPv6 regex as part of the DFA.
          Hide
          Steve Rowe added a comment -

          This patch contains 3 modifications:

          1. The IPv6Address macro in StandardTokenizerImpl.jflex now makes everything in front of the double colon optional, so that e.g. "::" alone is a valid address.
          2. The EMAILbracketedHost macro in StandardTokenizerImpl.jflex now contains IPv6 and IPv4 addresses, along with a comment about how DFA minimization keeps the size of the resulting DFA in check.
          3. Renamed the EMAILaddressStrict macro to EMAIL in StandardTokenizerImpl.jflex.
          4. The root.zone file format has changed (hunh? why? I don't know anything about DNS...), so I modified GenerateJflexTLDMacros.java to parse the current format in addition to the previous format.

          This version looks roughly the same in terms of performance - below are the numbers for the 25 round, 20k doc benchmark:

          Operation recsPerRun rec/s elapsedSec
          ClassicTokenizer 2467769 661,245.69 3.73
          ICUTokenizer 2481688 544,827.25 4.55
          RBBITokenizer 2481688 571,817.50 4.34
          StandardTokenizer 2481687 650,848.94 3.81
          UAX29Tokenizer 2481688 655,317.69 3.79

          For some reason, the size of the .class file for StandardTokenizerImpl.jflex is smaller: 51,798 bytes.

          Show
          Steve Rowe added a comment - This patch contains 3 modifications: The IPv6Address macro in StandardTokenizerImpl.jflex now makes everything in front of the double colon optional, so that e.g. "::" alone is a valid address. The EMAILbracketedHost macro in StandardTokenizerImpl.jflex now contains IPv6 and IPv4 addresses, along with a comment about how DFA minimization keeps the size of the resulting DFA in check. Renamed the EMAILaddressStrict macro to EMAIL in StandardTokenizerImpl.jflex . The root.zone file format has changed (hunh? why? I don't know anything about DNS...), so I modified GenerateJflexTLDMacros.java to parse the current format in addition to the previous format. This version looks roughly the same in terms of performance - below are the numbers for the 25 round, 20k doc benchmark: Operation recsPerRun rec/s elapsedSec ClassicTokenizer 2467769 661,245.69 3.73 ICUTokenizer 2481688 544,827.25 4.55 RBBITokenizer 2481688 571,817.50 4.34 StandardTokenizer 2481687 650,848.94 3.81 UAX29Tokenizer 2481688 655,317.69 3.79 For some reason, the size of the .class file for StandardTokenizerImpl.jflex is smaller: 51,798 bytes.
          Hide
          Steve Rowe added a comment -

          Attached patch includes a Perl script to generate a test based on Unicode.org's WordBreakTest.txt UAX#29 test sequences, along with the java source generated by the Perl script. Both UAX29Tokenizer and StandardTokenizerImpl are tested, and all Lucene and Solr tests pass. I added a note to modules/analyzer/NOTICE.txt about the Unicode.org data files used in creating the test class.

          This test suite turned up a problem in both tested grammars: the WORD_TYPE rule could match zero characters, and so was in certain cases involving underscores returning a zero-length token instead of end-of-stream. I fixed the issue by changing the rule in both grammars to require at least one character for a match to succeed. All test sequences are now successfully tokenized.

          I attempted to also test ICUAnalyzer, but since it downcases, the expected tokens are incorrect in some cases. I didn't pursue it further.

          I ran the best-of-25-rounds/20k docs benchmark, and the grammar change has not noticeably affected the results.

          Show
          Steve Rowe added a comment - Attached patch includes a Perl script to generate a test based on Unicode.org's WordBreakTest.txt UAX#29 test sequences, along with the java source generated by the Perl script. Both UAX29Tokenizer and StandardTokenizerImpl are tested, and all Lucene and Solr tests pass. I added a note to modules/analyzer/NOTICE.txt about the Unicode.org data files used in creating the test class. This test suite turned up a problem in both tested grammars: the WORD_TYPE rule could match zero characters, and so was in certain cases involving underscores returning a zero-length token instead of end-of-stream. I fixed the issue by changing the rule in both grammars to require at least one character for a match to succeed. All test sequences are now successfully tokenized. I attempted to also test ICUAnalyzer, but since it downcases, the expected tokens are incorrect in some cases. I didn't pursue it further. I ran the best-of-25-rounds/20k docs benchmark, and the grammar change has not noticeably affected the results.
          Hide
          Steve Rowe added a comment -

          Removed unnecessarily re-generated WikipediaTokenizerImpl.java in the previous patch from this patch.

          Show
          Steve Rowe added a comment - Removed unnecessarily re-generated WikipediaTokenizerImpl.java in the previous patch from this patch.
          Hide
          Simon Willnauer added a comment -

          Last update is a month ago - any idea how far away this is from being committable?

          Show
          Simon Willnauer added a comment - Last update is a month ago - any idea how far away this is from being committable?
          Hide
          Steve Rowe added a comment -

          Last update is a month ago - any idea how far away this is from being committable?

          Trunk version functionality is complete. Needs docs and backporting to 3.x branch.

          I'm finishing up LUCENE-2611 (to make backporting a little less painful), and then I'll get back to this issue.

          Rough completion estimate for this issue: 2010-09-13 @ 02:37 GMT-5.

          Show
          Steve Rowe added a comment - Last update is a month ago - any idea how far away this is from being committable? Trunk version functionality is complete. Needs docs and backporting to 3.x branch. I'm finishing up LUCENE-2611 (to make backporting a little less painful), and then I'll get back to this issue. Rough completion estimate for this issue: 2010-09-13 @ 02:37 GMT-5.
          Hide
          Simon Willnauer added a comment -

          Rough completion estimate for this issue: 2010-09-13 @ 02:37 GMT-5.

          Awesome!

          Show
          Simon Willnauer added a comment - Rough completion estimate for this issue: 2010-09-13 @ 02:37 GMT-5. Awesome!
          Hide
          Robert Muir added a comment -

          Trunk version functionality is complete. Needs docs and backporting to 3.x branch.

          I agree, the testing is now very nice.
          For example, when we want to bump to Unicode 6.0 we can autogenerate a test class from the 6.0 data files with the perl script.
          Great work.

          Show
          Robert Muir added a comment - Trunk version functionality is complete. Needs docs and backporting to 3.x branch. I agree, the testing is now very nice. For example, when we want to bump to Unicode 6.0 we can autogenerate a test class from the 6.0 data files with the perl script. Great work.
          Hide
          Robert Muir added a comment -

          I'm finishing up LUCENE-2611 (to make backporting a little less painful), and then I'll get back to this issue.

          By the way, I dont think you need to produce an explicit 3.x patch?
          we should be able to svn merge without much trouble I think.

          Show
          Robert Muir added a comment - I'm finishing up LUCENE-2611 (to make backporting a little less painful), and then I'll get back to this issue. By the way, I dont think you need to produce an explicit 3.x patch? we should be able to svn merge without much trouble I think.
          Hide
          Steve Rowe added a comment -

          By the way, I dont think you need to produce an explicit 3.x patch?
          we should be able to svn merge without much trouble I think.

          Great, for some reason I thought you had said that backporting would require lots of decisions, so I assumed it would require a separate patch.

          That leaves documentation. I think I need a MIGRATE.txt entry, some package-level documentation, and notes cross-referencing from ClassicTokenizer/Analyzer to StandardTokenizer/Analyzer and vice-versa. Anything else?

          Show
          Steve Rowe added a comment - By the way, I dont think you need to produce an explicit 3.x patch? we should be able to svn merge without much trouble I think. Great, for some reason I thought you had said that backporting would require lots of decisions, so I assumed it would require a separate patch. That leaves documentation. I think I need a MIGRATE.txt entry, some package-level documentation, and notes cross-referencing from ClassicTokenizer/Analyzer to StandardTokenizer/Analyzer and vice-versa. Anything else?
          Hide
          Robert Muir added a comment -

          That leaves documentation. I think I need a MIGRATE.txt entry, some package-level documentation, and notes cross-referencing from ClassicTokenizer/Analyzer to StandardTokenizer/Analyzer and vice-versa. Anything else?

          Agreed, though the change is completely backwards compatible, so I don't know if we need a MIGRATE.txt entry?

          (separately I realize its a big change, but there is no back compat issue)

          Show
          Robert Muir added a comment - That leaves documentation. I think I need a MIGRATE.txt entry, some package-level documentation, and notes cross-referencing from ClassicTokenizer/Analyzer to StandardTokenizer/Analyzer and vice-versa. Anything else? Agreed, though the change is completely backwards compatible, so I don't know if we need a MIGRATE.txt entry? (separately I realize its a big change, but there is no back compat issue)
          Hide
          Steve Rowe added a comment -

          Updated to trunk. All tests pass. Documentation improved at package and class level. modules/analysis/CHANGES.txt entry included.

          I think this is ready to commit.

          Show
          Steve Rowe added a comment - Updated to trunk. All tests pass. Documentation improved at package and class level. modules/analysis/CHANGES.txt entry included. I think this is ready to commit.
          Hide
          Robert Muir added a comment -

          I think this is ready to commit.

          I think so too, i applied the svn moves and the patch and all tests pass.

          One last question, it might be reasonable to move ClassicTokenizer and friends to .classic package?
          There is nothing standards-based about them at all and it makes the .standard directory a little confusing.

          To do this i would have to make StandardTokenizerInterface public, but it could marked @lucene.internal.

          Show
          Robert Muir added a comment - I think this is ready to commit. I think so too, i applied the svn moves and the patch and all tests pass. One last question, it might be reasonable to move ClassicTokenizer and friends to .classic package? There is nothing standards-based about them at all and it makes the .standard directory a little confusing. To do this i would have to make StandardTokenizerInterface public, but it could marked @lucene.internal.
          Hide
          Robert Muir added a comment -

          One last question, it might be reasonable to move ClassicTokenizer and friends to .classic package?

          by the way, if we decide this is best, i would like to open a new issue for it.
          we don't have to do everything in one step, and currently this patch cleanly applies with the svn move instructions.

          so I would like to commit this patch in a few days as-is if they are no objections.

          if we want to improve packaging lets open a followup-issue.

          Show
          Robert Muir added a comment - One last question, it might be reasonable to move ClassicTokenizer and friends to .classic package? by the way, if we decide this is best, i would like to open a new issue for it. we don't have to do everything in one step, and currently this patch cleanly applies with the svn move instructions. so I would like to commit this patch in a few days as-is if they are no objections. if we want to improve packaging lets open a followup-issue.
          Hide
          Steve Rowe added a comment -

          One last question, it might be reasonable to move ClassicTokenizer and friends to .classic package?

          I agree with your arguments about moving to .classic package. I think new users won't care about what StandardTokenizer/Analyzer used to be.

          My only concern here is existing users' upgrade experience - users should be able to continue using the ClassicTokenizer if they want to keep current behavior. Right now, they can do that by setting Version to 3.0 in the constructor to StandardTokenizer/Analyzer. I think this should remain the case until Lucene version 5.0.

          Show
          Steve Rowe added a comment - One last question, it might be reasonable to move ClassicTokenizer and friends to .classic package? I agree with your arguments about moving to .classic package. I think new users won't care about what StandardTokenizer/Analyzer used to be. My only concern here is existing users' upgrade experience - users should be able to continue using the ClassicTokenizer if they want to keep current behavior. Right now, they can do that by setting Version to 3.0 in the constructor to StandardTokenizer/Analyzer. I think this should remain the case until Lucene version 5.0.
          Hide
          Robert Muir added a comment -

          My only concern here is existing users' upgrade experience - users should be able to continue using the ClassicTokenizer if they want to keep current behavior. Right now, they can do that by setting Version to 3.0 in the constructor to StandardTokenizer/Analyzer. I think this should remain the case until Lucene version 5.0.

          I agree completely, i think we can do this though with the Classic stuff in a separate package? (like we can have both)

          Show
          Robert Muir added a comment - My only concern here is existing users' upgrade experience - users should be able to continue using the ClassicTokenizer if they want to keep current behavior. Right now, they can do that by setting Version to 3.0 in the constructor to StandardTokenizer/Analyzer. I think this should remain the case until Lucene version 5.0. I agree completely, i think we can do this though with the Classic stuff in a separate package? (like we can have both)
          Hide
          Steve Rowe added a comment -

          I agree completely, i think we can do this though with the Classic stuff in a separate package? (like we can have both)

          Right, I didn't mean to say that moving the Classic stuff out of .standard was antithetical to preserving Classic functionality in StandardTokenizer - I just wanted to make sure that we agreed that that move doesn't mean complete separation (yet). Sounds like we agree.

          Show
          Steve Rowe added a comment - I agree completely, i think we can do this though with the Classic stuff in a separate package? (like we can have both) Right, I didn't mean to say that moving the Classic stuff out of .standard was antithetical to preserving Classic functionality in StandardTokenizer - I just wanted to make sure that we agreed that that move doesn't mean complete separation (yet). Sounds like we agree.
          Hide
          Simon Willnauer added a comment -

          Assignee: Steven Rowe (was: Robert Muir)

          Yay!

          Show
          Simon Willnauer added a comment - Assignee: Steven Rowe (was: Robert Muir) Yay!
          Hide
          Steve Rowe added a comment -

          Sync'd to trunk (TestThaiAnalyzer.java had conflicts). All tests pass. Committing shortly.

          Show
          Steve Rowe added a comment - Sync'd to trunk (TestThaiAnalyzer.java had conflicts). All tests pass. Committing shortly.
          Hide
          Steve Rowe added a comment -

          Committed to trunk r1002032.

          I'll work on merging to the 3.X branch tomorrow.

          Show
          Steve Rowe added a comment - Committed to trunk r1002032. I'll work on merging to the 3.X branch tomorrow.
          Hide
          Steve Rowe added a comment -

          Backported to 3.x branch revision 1002468

          Show
          Steve Rowe added a comment - Backported to 3.x branch revision 1002468
          Hide
          Robert Muir added a comment -

          I'd like to re-open this issue.

          I think that full urls as tokens is not a good default for StandardTokenizer, because i don't think users ever search
          on full URLS. its also dangerous, many apps that upgrade will find themselves with huge terms dictionaries,
          and different performance characteristics.

          i think it would be better if standardtokenizer just implemented the uax#29 algorithm. the url identification we could
          keep as a separate tokenizer for people that want that.

          Show
          Robert Muir added a comment - I'd like to re-open this issue. I think that full urls as tokens is not a good default for StandardTokenizer, because i don't think users ever search on full URLS. its also dangerous, many apps that upgrade will find themselves with huge terms dictionaries, and different performance characteristics. i think it would be better if standardtokenizer just implemented the uax#29 algorithm. the url identification we could keep as a separate tokenizer for people that want that.
          Hide
          Michael McCandless added a comment -

          +1

          When I indexed Wikipedia w/ StandardAnalyzer I saw a huge number of full-url tokens, which is just silly as a default. Inserting WordDelimiterFilter fixed it, but, I don't think StandardTokenizer should produce whole URLs as tokens, to begin with.

          Show
          Michael McCandless added a comment - +1 When I indexed Wikipedia w/ StandardAnalyzer I saw a huge number of full-url tokens, which is just silly as a default. Inserting WordDelimiterFilter fixed it, but, I don't think StandardTokenizer should produce whole URLs as tokens, to begin with.
          Hide
          Steve Rowe added a comment -

          I think that full urls as tokens is not a good default for StandardTokenizer, because i don't think users ever search
          on full URLS.

          Probably true, but this is a chicken and egg issue, no? Maybe people never search on full URLs because it doesn't work, because there is no tokenization support for it?

          My preferred solution here, as I said earlier in this issue, is to use a decomposing filter, because when people want full URLs, they can't be reassembled after the separator chars are thrown away by the tokenizer.

          Robert, when I mentioned the decomposition filter, you said you didn't like that idea. Do you still feel the same?

          I'm really reluctant to drop the ability to recognize full URLs. I agree, though, that as a default it's not good.

          Show
          Steve Rowe added a comment - I think that full urls as tokens is not a good default for StandardTokenizer, because i don't think users ever search on full URLS. Probably true, but this is a chicken and egg issue, no? Maybe people never search on full URLs because it doesn't work, because there is no tokenization support for it? My preferred solution here, as I said earlier in this issue , is to use a decomposing filter, because when people want full URLs, they can't be reassembled after the separator chars are thrown away by the tokenizer. Robert, when I mentioned the decomposition filter, you said you didn't like that idea. Do you still feel the same? I'm really reluctant to drop the ability to recognize full URLs. I agree, though, that as a default it's not good.
          Hide
          Steve Rowe added a comment -

          I don't think StandardTokenizer should produce whole URLs as tokens, to begin with.

          I think Standard Analyzer should not by default produce whole URLs as tokens. But (yay repetition!) if the tokenizer throws away the separator chars, URLs can't be reassembled from their parts.

          Would a URL decomposition filter, with full URL emission turned off by default, work here?

          Show
          Steve Rowe added a comment - I don't think StandardTokenizer should produce whole URLs as tokens, to begin with. I think Standard Analyzer should not by default produce whole URLs as tokens. But (yay repetition!) if the tokenizer throws away the separator chars, URLs can't be reassembled from their parts. Would a URL decomposition filter, with full URL emission turned off by default, work here?
          Hide
          Robert Muir added a comment -

          because when people want full URLs, they can't be reassembled after the separator chars are thrown away by the tokenizer.

          Well, i dont much like this argument, because its true about anything.
          Indexing text for search is a lossy thing by definition.

          yeah, when you tokenize this stuff, you lose paragraphs, sentences, all kinds of things.
          should we output whole paragraphs as tokens so its not lost?

          Robert, when I mentioned the decomposition filter, you said you didn't like that idea. Do you still feel the same?

          Well, i said it was a can of worms, i still feel that it is complicated, yes.
          But i mean we do have a ghetto decomposition filter (WordDelimiterFilter) already.
          And someone can use this with the UAX#29+URLRecognizingTokenizer to index these urls in a variety of ways, including preserving the original full url too.

          Would a URL decomposition filter, with full URL emission turned off by default, work here?

          It works in theory, but its confusing that its 'required' to not get absymal tokens.
          i would prefer we switch the situation around: make UAX#29 'standardtokenizer' and give the uax#29+url+email+ip+... a different name.

          because to me, uax#29 handles urls in nice ways, e.g. my user types 'facebook' and they get back facebook.com
          its certainly simple and won't blow up terms dictionaries...

          otherwise, creating lots of long, unique tokens (urls) by default is a serious performance trap, particularly for lucene 3.x

          Show
          Robert Muir added a comment - because when people want full URLs, they can't be reassembled after the separator chars are thrown away by the tokenizer. Well, i dont much like this argument, because its true about anything. Indexing text for search is a lossy thing by definition. yeah, when you tokenize this stuff, you lose paragraphs, sentences, all kinds of things. should we output whole paragraphs as tokens so its not lost? Robert, when I mentioned the decomposition filter, you said you didn't like that idea. Do you still feel the same? Well, i said it was a can of worms, i still feel that it is complicated, yes. But i mean we do have a ghetto decomposition filter (WordDelimiterFilter) already. And someone can use this with the UAX#29+URLRecognizingTokenizer to index these urls in a variety of ways, including preserving the original full url too. Would a URL decomposition filter, with full URL emission turned off by default, work here? It works in theory, but its confusing that its 'required' to not get absymal tokens. i would prefer we switch the situation around: make UAX#29 'standardtokenizer' and give the uax#29+url+email+ip+... a different name. because to me, uax#29 handles urls in nice ways, e.g. my user types 'facebook' and they get back facebook.com its certainly simple and won't blow up terms dictionaries... otherwise, creating lots of long, unique tokens (urls) by default is a serious performance trap, particularly for lucene 3.x
          Hide
          Steve Rowe added a comment -

          i would prefer we switch the situation around: make UAX#29 'standardtokenizer' and give the uax#29+url+email+ip+... a different name.

          UAX29Tokenizer does not have email or hostname recognition. StandardTokenizer has long had these capabilities (though not standard-based). Removing them would be bad.

          Show
          Steve Rowe added a comment - i would prefer we switch the situation around: make UAX#29 'standardtokenizer' and give the uax#29+url+email+ip+... a different name. UAX29Tokenizer does not have email or hostname recognition. StandardTokenizer has long had these capabilities (though not standard-based). Removing them would be bad.
          Hide
          Michael McCandless added a comment -

          Would it somehow be possible to allow multiple Tokenizers to work together?

          Today we only allow one (and then any number of subsequent TokenFilters) in the chain, so if your Tokenizer destroys information (eg erases the . from the host name) it's hard for subsequent TokenFilters to put them back.

          But if, say, we had a Tokenizer that recognizes hostnames/URLs, one that recognizes email addresses, one for proper names/places/date/time, other app dependent stuff like detecting part numbers and what not, then ideally one could simply cascade/compose these tokenizers at will to build up whatever "initial" tokenizer you need for you chain?

          I think our current lack of composability of the initial tokenizer ("there can be only one") makes cases like this hard...

          Show
          Michael McCandless added a comment - Would it somehow be possible to allow multiple Tokenizers to work together? Today we only allow one (and then any number of subsequent TokenFilters) in the chain, so if your Tokenizer destroys information (eg erases the . from the host name) it's hard for subsequent TokenFilters to put them back. But if, say, we had a Tokenizer that recognizes hostnames/URLs, one that recognizes email addresses, one for proper names/places/date/time, other app dependent stuff like detecting part numbers and what not, then ideally one could simply cascade/compose these tokenizers at will to build up whatever "initial" tokenizer you need for you chain? I think our current lack of composability of the initial tokenizer ("there can be only one") makes cases like this hard...
          Hide
          Robert Muir added a comment -

          UAX29Tokenizer does not have email or hostname recognition. StandardTokenizer has long had these capabilities (though not standard-based). Removing them would be bad.

          Thats true, so maybe something in the "middle" / "compromise" is better as a default.

          I just tend to really like plain old "uax#29" as a default, since its consistent with how "tokenization" works elsewhere in people's wordprocessors, browsers, etc
          (e.g. control-F find, that sort of thing), where they dont know specifics of content and want to just have a reasonable default.

          but there might be something else we can do, too.

          Show
          Robert Muir added a comment - UAX29Tokenizer does not have email or hostname recognition. StandardTokenizer has long had these capabilities (though not standard-based). Removing them would be bad. Thats true, so maybe something in the "middle" / "compromise" is better as a default. I just tend to really like plain old "uax#29" as a default, since its consistent with how "tokenization" works elsewhere in people's wordprocessors, browsers, etc (e.g. control-F find, that sort of thing), where they dont know specifics of content and want to just have a reasonable default. but there might be something else we can do, too.
          Hide
          Robert Muir added a comment -

          But if, say, we had a Tokenizer that recognizes hostnames/URLs, one that recognizes email addresses, one for proper names/places/date/time, other app dependent stuff like detecting part numbers and what not, then ideally one could simply cascade/compose these tokenizers at will to build up whatever "initial" tokenizer you need for you chain?

          I think our current lack of composability of the initial tokenizer ("there can be only one") makes cases like this hard...

          I agree that sounds like a "cool" idea to have, but at the same time, we should try to not make analysis the "wonder-do-it-all" machine.
          I mean some stuff belongs in the app, and i think that includes a lot of things you mentioned... e.g. the app can do "NER" and pull
          out proper names/places/dates and put them in separate fields.

          I don't think the analysis chain is the easiest or best place to do this, i would prefer if we tried to keep the complexity down and recognize
          that some things (really a lot of this "recognizer" stuff) might be better implemented in the app.

          Show
          Robert Muir added a comment - But if, say, we had a Tokenizer that recognizes hostnames/URLs, one that recognizes email addresses, one for proper names/places/date/time, other app dependent stuff like detecting part numbers and what not, then ideally one could simply cascade/compose these tokenizers at will to build up whatever "initial" tokenizer you need for you chain? I think our current lack of composability of the initial tokenizer ("there can be only one") makes cases like this hard... I agree that sounds like a "cool" idea to have, but at the same time, we should try to not make analysis the "wonder-do-it-all" machine. I mean some stuff belongs in the app, and i think that includes a lot of things you mentioned... e.g. the app can do "NER" and pull out proper names/places/dates and put them in separate fields. I don't think the analysis chain is the easiest or best place to do this, i would prefer if we tried to keep the complexity down and recognize that some things (really a lot of this "recognizer" stuff) might be better implemented in the app.
          Hide
          Steve Rowe added a comment -

          I just tend to really like plain old "uax#29" as a default [...] i would prefer if we tried to keep the complexity down

          So we're talking about two separate issues here: a) Lucene's default behavior; and b) Lucene's capabilities.

          For a), you'll have a lot of 'splaining to do if you drop existing functionality (e.g. email and hostname "recognition" – where quotes indicate "bad" things, right? "Cool"!)

          For b), you appear to agree with Marvin Humphries about keeping the product lean and mean: complexity (a.k.a. functionality beyond the default) is bad because it creates maintenance problems.

          we should try to not make analysis the "wonder-do-it-all" machine.

          Why not? Why shouldn't Lucene be a catch-all for "cool" linguistic stuff?

          Show
          Steve Rowe added a comment - I just tend to really like plain old "uax#29" as a default [...] i would prefer if we tried to keep the complexity down So we're talking about two separate issues here: a) Lucene's default behavior; and b) Lucene's capabilities. For a), you'll have a lot of 'splaining to do if you drop existing functionality (e.g. email and hostname "recognition" – where quotes indicate "bad" things, right? "Cool"!) For b), you appear to agree with Marvin Humphries about keeping the product lean and mean: complexity (a.k.a. functionality beyond the default) is bad because it creates maintenance problems. we should try to not make analysis the "wonder-do-it-all" machine. Why not? Why shouldn't Lucene be a catch- all for "cool" linguistic stuff?
          Hide
          Robert Muir added a comment -

          So we're talking about two separate issues here: a) Lucene's default behavior; and b) Lucene's capabilities.

          agreed!

          For a), you'll have a lot of 'splaining to do if you drop existing functionality (e.g. email and hostname "recognition" - where quotes indicate "bad" things, right? "Cool"!)

          to me recognizing hostnames is specific to what one application might want.
          if you recognize www.facebook.com but my app wants to find this with a query of 'facebook', it cant.
          yet if just stick to uax#29, if a user queries on www.facebook.com, and they are unsatisfied with the results,
          that user can always "refine" their query by searching on "www.facebook.com" and they get a phrasequery.
          I think this is pretty intuitive and users are used to this... again this is just for general defaults...

          and again, hostnames are just an example, why do we recognize them and not filenames?
          yet a lot of people are happy being able to do 'partial filename' matching and not the whole path...
          users that are unhappy with this 'default' behavior can use double quotes to refine their results.

          and in both cases, apps that need something more specific can use a custom tokenizer.

          Why not? Why shouldn't Lucene be a catch-all for "cool" linguistic stuff?

          In this case I think analysis won't meet their needs anyway. a lot of people wanting to recognize full urls or proper names (mike's example)
          actually want to do this in the 'document build' and dump the extracted entities into a separate field, so they can do things like
          facet on this field, or find other documents that refer to the same person. This is because they are trying to 'find structure in the unstructured',
          but it starts to get complicated if we mix this problem with 'feature extraction' which is what i think analysis should be.

          Show
          Robert Muir added a comment - So we're talking about two separate issues here: a) Lucene's default behavior; and b) Lucene's capabilities. agreed! For a), you'll have a lot of 'splaining to do if you drop existing functionality (e.g. email and hostname "recognition" - where quotes indicate "bad" things, right? "Cool"!) to me recognizing hostnames is specific to what one application might want. if you recognize www.facebook.com but my app wants to find this with a query of 'facebook', it cant. yet if just stick to uax#29, if a user queries on www.facebook.com, and they are unsatisfied with the results, that user can always "refine" their query by searching on "www.facebook.com" and they get a phrasequery. I think this is pretty intuitive and users are used to this... again this is just for general defaults... and again, hostnames are just an example, why do we recognize them and not filenames? yet a lot of people are happy being able to do 'partial filename' matching and not the whole path... users that are unhappy with this 'default' behavior can use double quotes to refine their results. and in both cases, apps that need something more specific can use a custom tokenizer. Why not? Why shouldn't Lucene be a catch-all for "cool" linguistic stuff? In this case I think analysis won't meet their needs anyway. a lot of people wanting to recognize full urls or proper names (mike's example) actually want to do this in the 'document build' and dump the extracted entities into a separate field, so they can do things like facet on this field, or find other documents that refer to the same person. This is because they are trying to 'find structure in the unstructured', but it starts to get complicated if we mix this problem with 'feature extraction' which is what i think analysis should be.
          Hide
          Steve Rowe added a comment -

          Would it somehow be possible to allow multiple Tokenizers to work together?

          The only thing I can think of right now is a new kind of component that feeds raw text (or post-char-filter text) to a configurable set of tokenizers/recognizers, then melds their results using some (hopefully configurable) strategy, like "longest-match-wins" or "create-overlapping-tokens", etc. This would slow things down, of course, since analysis has to be performed multiple times over the same chunk of input text...

          Show
          Steve Rowe added a comment - Would it somehow be possible to allow multiple Tokenizers to work together? The only thing I can think of right now is a new kind of component that feeds raw text (or post-char-filter text) to a configurable set of tokenizers/recognizers, then melds their results using some (hopefully configurable) strategy, like "longest-match-wins" or "create-overlapping-tokens", etc. This would slow things down, of course, since analysis has to be performed multiple times over the same chunk of input text...
          Hide
          Steve Rowe added a comment -

          if you recognize www.facebook.com but my app wants to find this with a query of 'facebook', it cant. yet if just stick to uax#29, if a user queries on www.facebook.com, and they are unsatisfied with the results, that

          "www.facebook.com" is way non-intuitive. My guess is the average user would never go there: how is something a phrase, and in need of bounding quotes, if it has no spaces in it?

          Show
          Steve Rowe added a comment - if you recognize www.facebook.com but my app wants to find this with a query of 'facebook', it cant. yet if just stick to uax#29, if a user queries on www.facebook.com, and they are unsatisfied with the results, that "www.facebook.com" is way non-intuitive. My guess is the average user would never go there: how is something a phrase, and in need of bounding quotes, if it has no spaces in it?
          Hide
          Robert Muir added a comment -

          "www.facebook.com" is way non-intuitive

          well, i'm just saying that the "UAX#29" behavior i describe, people are used to:

          • google and twitter search engines find and highlight say 'cnn' in urls such as 'http:/www.cnn.com/x/y'
          • this is how "find" in apps such as browsers, word processors, even windows notepad work.
          • the idea of putting quotes around things to be "more exact" is pretty general, e.g. in google i refine queries like "documents" with quotes to prevent stemming: try it.

          So i think its just intuitive and becoming rather universal to put quotes around things to get a "more exact search".

          Like i said, i'm not too picky how we solve the problem, but i think UAX#29 is a great default... its used everywhere else...

          Show
          Robert Muir added a comment - "www.facebook.com" is way non-intuitive well, i'm just saying that the "UAX#29" behavior i describe, people are used to: google and twitter search engines find and highlight say 'cnn' in urls such as 'http:/www.cnn.com/x/y' this is how "find" in apps such as browsers, word processors, even windows notepad work. the idea of putting quotes around things to be "more exact" is pretty general, e.g. in google i refine queries like "documents" with quotes to prevent stemming: try it. So i think its just intuitive and becoming rather universal to put quotes around things to get a "more exact search". Like i said, i'm not too picky how we solve the problem, but i think UAX#29 is a great default... its used everywhere else...
          Hide
          Steve Rowe added a comment -

          So i think its just intuitive and becoming rather universal to put quotes around things to get a "more exact search".

          You've convinced me, though I don't think this idea has been around long enough to qualify as intiutive.

          hostnames are just an example, why do we recognize them and not filenames?

          Although following precedent is important (principle of least surprise), we have to be able to revisit these decisions. My philosophy tends toward kitchen-sinkness, while allowing people to ignore the stuff they don't want (today). So, yeah, I think we should (be able to) recognize filenames, at least as part of a URL-decomposing filter:

          http://www.example.com/path/file%20name.html?param=value#fragment

          =>

          http://www.example.com/path/file%20name.html?param=value#fragment

          <URL>
          www.example.com <HOSTNAME>
          example.com <HOSTNAME>
          example <HOSTNAME>
          com <HOSTNAME>
          path <URL_PATH_ELEMENT>
          file name.html <URL_FILENAME>
          file name <URL_FILENAME>
          file <URL_FILENAME>
          name <URL_FILENAME>
          html <URL_FILENAME>
          param <URL_PARAMETER>
          value <URL_PARAMETER_VALUE>
          fragment <URL_FRAGMENT>

          Output of each token type could be optional in a URL decomposition filter. The URL decomposition filter could serve as a place to handle punycode, too.

          i'm not too picky how we solve the problem, but i think UAX#29 is a great default... its used everywhere else...

          I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative that provides the same thing. So we would have UAX#29 tokenizer as default; a UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that provides a configurable feature to not output URLs, but instead HOSTNAMEs and URL component tokens?

          Show
          Steve Rowe added a comment - So i think its just intuitive and becoming rather universal to put quotes around things to get a "more exact search". You've convinced me, though I don't think this idea has been around long enough to qualify as intiutive. hostnames are just an example, why do we recognize them and not filenames? Although following precedent is important (principle of least surprise), we have to be able to revisit these decisions. My philosophy tends toward kitchen-sinkness, while allowing people to ignore the stuff they don't want (today). So, yeah, I think we should (be able to) recognize filenames, at least as part of a URL-decomposing filter: http://www.example.com/path/file%20name.html?param=value#fragment => http://www.example.com/path/file%20name.html?param=value#fragment <URL> www.example.com <HOSTNAME> example.com <HOSTNAME> example <HOSTNAME> com <HOSTNAME> path <URL_PATH_ELEMENT> file name.html <URL_FILENAME> file name <URL_FILENAME> file <URL_FILENAME> name <URL_FILENAME> html <URL_FILENAME> param <URL_PARAMETER> value <URL_PARAMETER_VALUE> fragment <URL_FRAGMENT> Output of each token type could be optional in a URL decomposition filter. The URL decomposition filter could serve as a place to handle punycode, too. i'm not too picky how we solve the problem, but i think UAX#29 is a great default... its used everywhere else... I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative that provides the same thing. So we would have UAX#29 tokenizer as default; a UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that provides a configurable feature to not output URLs, but instead HOSTNAMEs and URL component tokens?
          Hide
          Robert Muir added a comment -

          You've convinced me, though I don't think this idea has been around long enough to qualify as intiutive.

          Well obviously i dont have hard references to this stuff, but from my interaction with my own users, most of them
          dont even think of double quotes as doing phrases, nor are they technical enough to even know what a phrase
          is or what that means for a search... they just think of it as more exact.

          I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative that provides the same thing. So we would have UAX#29 tokenizer as default; a UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that provides a configurable feature to not output URLs, but instead HOSTNAMEs and URL component tokens?

          Well, like i said, i'm not particularly picky, especially since someone can always use ClassicTokenizer to get the old behavior,
          which, no one could ever agree on and there was constantly issues about not recognizing my company's name etc etc.

          To some extent, i like UAX#29 because there's someone else making and standardizing the decisions and validating
          its not gonna annoy users of major languages, and making sure it works well by default: like its not gonna be the most
          full-featured tokenizer but theres little chance it will be really annoying: i think this is great for "defaults".

          as for all the other "bonus" stuff we can always make options, especially if its some pluggable thing somehow (sorry not sure about how this could work in jflex)
          where you could have options as to what you want to do.

          but again, i think UAX#29 itself is more than sufficient by default, and even hostname etc is pretty dangerous by default
          (again my example of searching partial hostnames being flexible to the end-user and not baked-in, by letting them using quotes).

          Show
          Robert Muir added a comment - You've convinced me, though I don't think this idea has been around long enough to qualify as intiutive. Well obviously i dont have hard references to this stuff, but from my interaction with my own users, most of them dont even think of double quotes as doing phrases, nor are they technical enough to even know what a phrase is or what that means for a search... they just think of it as more exact. I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative that provides the same thing. So we would have UAX#29 tokenizer as default; a UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that provides a configurable feature to not output URLs, but instead HOSTNAMEs and URL component tokens? Well, like i said, i'm not particularly picky, especially since someone can always use ClassicTokenizer to get the old behavior, which, no one could ever agree on and there was constantly issues about not recognizing my company's name etc etc. To some extent, i like UAX#29 because there's someone else making and standardizing the decisions and validating its not gonna annoy users of major languages, and making sure it works well by default: like its not gonna be the most full-featured tokenizer but theres little chance it will be really annoying: i think this is great for "defaults". as for all the other "bonus" stuff we can always make options, especially if its some pluggable thing somehow (sorry not sure about how this could work in jflex) where you could have options as to what you want to do. but again, i think UAX#29 itself is more than sufficient by default, and even hostname etc is pretty dangerous by default (again my example of searching partial hostnames being flexible to the end-user and not baked-in, by letting them using quotes).
          Hide
          Earwin Burrfoot added a comment -

          Would it somehow be possible to allow multiple Tokenizers to work together?

          The fact that Tokenizers now are not TokenFilters bugs me somewhat.
          In theory, you should just feed initial text as a single monster token from hell into analysis chain, and then you only have TokenFilters, none/one/some of which might split this token.
          If there are no TokenFilters at all, you get a NOT_ANALYZED case without extra flags, yahoo!

          The only problem here is the need for ability to wrap arbitrary Reader in a TermAttribute :/

          But (yay repetition!) if the tokenizer throws away the separator chars, URLs can't be reassembled from their parts.

          Why don't we teach StandardTokenizer to produce tokens for separator chars?
          A special filter at the end of the chain will drop them, so they won't get into index.
          And in the midst of the filter chain you are free to do whatever you want with them - detect emails/urls/sentences/whatever.

          Show
          Earwin Burrfoot added a comment - Would it somehow be possible to allow multiple Tokenizers to work together? The fact that Tokenizers now are not TokenFilters bugs me somewhat. In theory, you should just feed initial text as a single monster token from hell into analysis chain, and then you only have TokenFilters, none/one/some of which might split this token. If there are no TokenFilters at all, you get a NOT_ANALYZED case without extra flags, yahoo! The only problem here is the need for ability to wrap arbitrary Reader in a TermAttribute :/ But (yay repetition!) if the tokenizer throws away the separator chars, URLs can't be reassembled from their parts. Why don't we teach StandardTokenizer to produce tokens for separator chars? A special filter at the end of the chain will drop them, so they won't get into index. And in the midst of the filter chain you are free to do whatever you want with them - detect emails/urls/sentences/whatever.
          Hide
          Steve Rowe added a comment -

          Why don't we teach StandardTokenizer to produce tokens for separator chars?

          I've been thinking about this - the word break rules in UAX#29 are intended for use in break iterators, and tokenizers take that one step further by discarding stuff between some breaks.

          StandardTokenizer is faster, though, since it doesn't have to tokenize the stuff between tokens, so if we go down this route, I think it should go somewhere else: UAX29WordBreakSegmenter or something like that.

          I'd like to have (nestable) SentenceSegmenter, ParagraphSegmenter, etc., the output from which could be the input to tokenizers.

          Show
          Steve Rowe added a comment - Why don't we teach StandardTokenizer to produce tokens for separator chars? I've been thinking about this - the word break rules in UAX#29 are intended for use in break iterators, and tokenizers take that one step further by discarding stuff between some breaks. StandardTokenizer is faster, though, since it doesn't have to tokenize the stuff between tokens, so if we go down this route, I think it should go somewhere else: UAX29WordBreakSegmenter or something like that. I'd like to have (nestable) SentenceSegmenter, ParagraphSegmenter, etc., the output from which could be the input to tokenizers.
          Hide
          Robert Muir added a comment -

          In theory, you should just feed initial text as a single monster token from hell into analysis chain, and then you only have TokenFilters, none/one/some of which might split this token.
          If there are no TokenFilters at all, you get a NOT_ANALYZED case without extra flags, yahoo!

          The only problem here is the need for ability to wrap arbitrary Reader in a TermAttribute :/

          No thanks, i dont want to read my entire documents into RAM and have massive gc'ing going on.
          We don't need to have a mega-tokenizer that solves everyones problems... this is just supposed to be a good "general-purpose" tokenizer.

          Show
          Robert Muir added a comment - In theory, you should just feed initial text as a single monster token from hell into analysis chain, and then you only have TokenFilters, none/one/some of which might split this token. If there are no TokenFilters at all, you get a NOT_ANALYZED case without extra flags, yahoo! The only problem here is the need for ability to wrap arbitrary Reader in a TermAttribute :/ No thanks, i dont want to read my entire documents into RAM and have massive gc'ing going on. We don't need to have a mega-tokenizer that solves everyones problems... this is just supposed to be a good "general-purpose" tokenizer.
          Hide
          Earwin Burrfoot added a comment -

          No thanks, i dont want to read my entire documents into RAM and have massive gc'ing going on.

          This is obvious. And that's why I was talking about wrapping Reader in an Attribute, not copying its contents.
          How to do so, is much less obvious. And that's why I called it a problem.

          We don't need to have a mega-tokenizer that solves everyones problems... this is just supposed to be a good "general-purpose" tokenizer.

          Exactly. That's why I'm thinking of a way to get some composability, instead of having to fully rewrite tokenizer once you want extras.

          Show
          Earwin Burrfoot added a comment - No thanks, i dont want to read my entire documents into RAM and have massive gc'ing going on. This is obvious. And that's why I was talking about wrapping Reader in an Attribute, not copying its contents. How to do so, is much less obvious. And that's why I called it a problem. We don't need to have a mega-tokenizer that solves everyones problems... this is just supposed to be a good "general-purpose" tokenizer. Exactly. That's why I'm thinking of a way to get some composability, instead of having to fully rewrite tokenizer once you want extras.
          Hide
          Robert Muir added a comment -

          This is obvious. And that's why I was talking about wrapping Reader in an Attribute, not copying its contents.
          How to do so, is much less obvious. And that's why I called it a problem.

          its not a problem, its just not possible. Because you don't know the required context of some downstream
          "composed" "partial-tokenizer" or whatever, so it must be all read in...

          I don't think we need to provide FooAnalyzer or even FooTokenizer that solves everyone's special case problems,
          its domain-dependent and not possible anyway... these are just general ones that solve a majority of use cases, examples really.

          This is why i think a simple UAX#29 standard should be the default... we can certainly have alternatives that do certain common things
          that people want though, no problem with that.

          Show
          Robert Muir added a comment - This is obvious. And that's why I was talking about wrapping Reader in an Attribute, not copying its contents. How to do so, is much less obvious. And that's why I called it a problem. its not a problem, its just not possible. Because you don't know the required context of some downstream "composed" "partial-tokenizer" or whatever, so it must be all read in... I don't think we need to provide FooAnalyzer or even FooTokenizer that solves everyone's special case problems, its domain-dependent and not possible anyway... these are just general ones that solve a majority of use cases, examples really. This is why i think a simple UAX#29 standard should be the default... we can certainly have alternatives that do certain common things that people want though, no problem with that.
          Hide
          Steve Rowe added a comment -

          See LUCENE-2763 for swapping UAX29Tokenizer and StandardTokenizer.

          Show
          Steve Rowe added a comment - See LUCENE-2763 for swapping UAX29Tokenizer and StandardTokenizer.

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Shyamal Prasad
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 0.5h
                0.5h
                Remaining:
                Remaining Estimate - 0.5h
                0.5h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development