Solr
  1. Solr
  2. SOLR-4123

ICUTokenizerFactory - per-script RBBI customization

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Initially this started out as an idea for a configuration knob on ICUTokenizer that would allow me to tell it not to tokenize on punctuation. Through IRC discussion on #lucene, it sorta ballooned. The committers had a long discussion about it that I don't really understand, so I'll be including it in the comments.

      I am a Solr user, so I would also need the ability to access the configuration from there, likely either in schema.xml or solrconfig.xml.

      1. SOLR-4123.patch
        22 kB
        Steve Rowe
      2. SOLR-4123.patch
        22 kB
        Steve Rowe
      3. SOLR-4123.patch
        20 kB
        Steve Rowe
      4. SOLR-4123.patch
        20 kB
        Steve Rowe
      5. SOLR-4123.patch
        5 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        you can do this from the java code, but the factory has a TODO about adding support for this.

        In my opinion the simplest would be that you can provide some script+textfile pairs and it makes an ICUTokenizerConfig (i think delegating to the default one otherwise?)

        at least... this is more customization than you have today with this factory (which is zero)

        Show
        Robert Muir added a comment - you can do this from the java code, but the factory has a TODO about adding support for this. In my opinion the simplest would be that you can provide some script+textfile pairs and it makes an ICUTokenizerConfig (i think delegating to the default one otherwise?) at least... this is more customization than you have today with this factory (which is zero)
        Hide
        Shawn Heisey added a comment -

        The IRC conversation:

        13:51 < sarowe> btw, icutokenizer is customizable - you can find the default
                        grammar at lucene/analysis/icu/src/data/uax29/Default.rbbi
        13:51 < sarowe> the dafault grammar is used when the text is not thai, lao, etc.
        13:51 < elyograg> i'm using solr, is any of that exposed there?
        13:52 < sarowe> you'd have to rebuild Default.brk and then repackage lucene-icu
                        jar, then include your rebuilt jar in solr classpath
        13:52 < sarowe> that's if you don't want to rename it
        13:53 < sarowe> more better would be to clone ICUTokenizer sources, tweak
                        Default.rbbi, build and include a diff jar with Solr classpath
        13:53 < hoss> sarowe: i don't really know anythign about ICU< but are these
                      ".brk" files loaded at run time? could we make the factory take
                      one as an arg?
        13:53 < elyograg> I'll have to look into that.  I'd probably make a new class.
        13:53 < sarowe> no
        13:53 < sarowe> :)
        13:53 < sarowe> I meant no to including the .brk file at runtime
        13:54 < sarowe> they get compiled into some binary blob
        13:54 < sarowe> that gets used by ICU
        13:55 < hoss> ok .. if users can create their own *.rbbi file, and then use
                      that to generate a *.brk file, and then use that to generate a
                      binary blob file, could we make the ICU Factory configurable
                      about which blob file to load?
        13:55 < sarowe> 'ant genrbbi' under lucene/analysis/icu/ does the binary blob
                        building, using the ICU RBBI rule compiler
        13:55 < sarowe> hoss: hmm, potentially, yes
        13:56 < sarowe> the .brk file is the binary blob, BTW
        13:57 < hoss> sarowe: how long does the binary blob conversaion take?  any
                      reason it couldn't be done on startup?
        13:57 < hoss> (since it's aparently java code)
        13:58 < sarowe> hoss: I think it's pretty fast - single digit seconds
        13:59 < hoss> so maybe it could all be done by the factory?
        13:59 < hoss> i have no idea if that makes sense .. just spit-balling
        13:59 < sarowe> not sure how valuable runtime generation would be though
        13:59 < hoss> sure ... i'm just asking because anytime the answer is "you need
                      to rebuild a stock jar" i start wondering if/how we can make that
                      more configurable
        13:59 < sarowe> I guess one benefit is that although people would have to know
                        the RBBI syntax, they wouldn't have to know how to drive the
                        Lucene build
        14:00 < hoss> right ... but if it's better/easier to just tell them "here's
                      this little java tool, run it" that's fine too
        14:00 < hoss> i'm just asking about the flexibility
        14:00 < sarowe> right, the current build doesn't have any way to substitute
                        alternatives
        14:00 < hoss> if we can eliminate a tool and make it all magic .. people like
                      magic
        14:01 < sarowe> I worry about syntax errors in a runtime Solr though
        14:01 < sarowe> this likely should be offline
        14:01 < sarowe> I mean the .brk compilation
        14:02 < hoss> fair enough ... i don't know enough about these files to know how
                      likely it is people will fuck up on editing them
        14:02 < sarowe> it's a syntax that only exists for that tool AFAIK
        14:02 < hoss> like i said: if there's a tool people can use to generate the
                      blob files, and then point at them from factory config, thta
                      alone sounds like a huge win over "recompile lucene"
        14:02 < sarowe> and misplaced curly brackets, e.g., is common enough in any
                        language
        14:03 < sarowe> I agree
        14:04 < sarowe> the JFlex-based grammars, by contrast, are much more tightly
                        bound with the java - recompilation is unavoidable
        14:05 < hoss> i know ... it sucks ... everytime i try to think about how to
                      make a really nicely confiugrable parser the whole jflex/javacc
                      compile to javacode thing depressess the hell out of me
        14:06 < sarowe> I think the best we can do there is introduce configuration
                        knobs into the generated java
        14:07 < sarowe> for example, the tokenize-punctuation proposal could be
                        implemented as a boolean option on the parser(s)
        14:08 < sarowe> so that by default the current behavior happens, but if people
                        ask for it, they could get punctuation tokens
        14:08 < sarowe> (actually, whitespace should be tokenized too, not just
                        punctuation - I mean all the stuff that currently gets tossed)
        14:10 < hoss> yeah ... just like autoGeneratePhraseQueries ... i'd really like
                      to make whole syntax options more conifgurable though (eg:
                      "should quotes be special syntax?", "what should ':' do?", all
                      the things edismax currently does i na really hacky way, etc...)
        14:10 < hoss> that kind of stuff is nearly impossible unless you can built up
                      the grammer from sub-pieces at runtime
        14:21 < rmuir> sarowe: sorry just catching up... i dont understand this talk of
                       binary blogs or whatever
        14:22 < rmuir> you just need this factory to support
                       ICUTOkenizer(TokenizerConfig) ctor ?
        14:22 < rmuir> there is no need for binary blobs or jar rebuilding
        14:22 < rmuir> instead some just way to declare that ICUTokenizerConfig via xml.
        14:23 < sarowe> rmuir: I'm looking at DefaultICUTokenizerConfig right now
        14:23 < sarowe> that would be a pretty complex XML format
        14:23 < rmuir> maybe...
        14:24 < rmuir> maybe it should just be some way to override the default rather
                       than redefine completely
        14:24 < sarowe> and if people want to change rules, they still need to involve
                        the RBBI compiler
        14:24 < rmuir> no they dont
        14:24 < rmuir> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#RuleBasedBreakIterator%28java.lang.String%29
        14:24 < rmuir> the factory just calls that
        14:25 < sarowe> that's not how it's currently done for icutokenizer though
        14:25 < sarowe> in lucene I mean
        14:25 < rmuir> yeah it is
        14:25 < rmuir> look at icutokenizerconfig
        14:25 < rmuir>   /** Return a breakiterator capable of processing a given
                       script. */
        14:25 < rmuir>   public abstract BreakIterator getBreakIterator(int script);
        14:25 < rmuir> i dont understand whats confusing about this :)
        14:25 < rmuir> if you make your own config, you can return one you made with
                       RBBI(String)
        14:26 < rmuir> the *default* implementation is just optimized via these .brks
                       so it loads faster.
        14:26 < sarowe> aha, so as i said, the current impl doesn't do this
        14:26 < rmuir> you said the icutokenizer
        14:26 < rmuir> it has no impl
        14:26 < rmuir> it has a default config really
        14:26 < rmuir> the tokenizer doesnt know about any of this
        14:26 < sarowe> ok, I get it
        14:27 < rmuir> anyway there is a TODO in the factory about this
        14:27 < rmuir> would be nice if someone has ideas on what could be a useful
                       format
        14:28 < sarowe> cool
        14:28 < rmuir> i dont think it needs to support all the possible customizations
                       (e.g. typing things differently and so on)
        14:28 < rmuir> just something simple
        14:28 < rmuir> maybe some way to "@override" the default one
        14:28 < rmuir> like you give it a list of script/textfile pairs
        14:32 < rmuir> if you use that RBBI(String) ctor it fully syntax checks and
                       stuff also
        14:32 < sarowe> cool
        14:32 < rmuir> actually if you look at the compiler i have to redundantly call
                       that just for that reason :)
        14:32 < rmuir> the general approach is you do this once, and then .clone()
                       whatever you make
        14:33 < rmuir> like collators
        14:35 < sarowe> what's redundant?  I don't see that
        14:36 < rmuir> in the compiler?
        14:36 < rmuir>       /*
        14:36 < rmuir>        * if there is a syntax error, compileRules() may succeed.
                       the way to
        14:36 < rmuir>        * check is to try to instantiate from the string.
                       additionally if the
        14:36 < rmuir>        * rules are invalid, you can get a useful syntax error.
        14:36 < rmuir>        */
        14:36 < rmuir>       try {
        14:36 < rmuir>         new RuleBasedBreakIterator(rules);
        14:36 < rmuir>       } catch (IllegalArgumentException e) {
        14:36 < rmuir>         /*
        14:36 < rmuir>          * do this intentionally, so you don't get a massive
                       stack trace
        14:36 < rmuir>          * instead, get a useful syntax error!
        14:36 < rmuir>          */
        14:36 < rmuir>         System.err.println(e.getMessage());
        14:36 < rmuir>         System.exit(1);
        14:36 < rmuir>       }
        14:36 < sarowe> oh!
        14:36 < rmuir> thats why i said its actually the opposite of what you might
                       think
        14:36 < sarowe> I was missing the fact that the compiler is in lucene and not
                        in ICU
        14:37 < rmuir> compileRules misses the checking
        14:37 < rmuir> but the string-ctor checks
        14:37 < rmuir> (at least at the time i wrote this thing thats how it worked)
        14:37 < sarowe> cool
        14:38 < rmuir> anyway, would be cool if we could enhance the factory
        14:38 < rmuir> as the rules are documented and what not pretty well
        14:38 < sarowe> agreed
        14:39 < sarowe> and if people came up with useful alternatives, those could be
                        bundled and chosen from
        14:39 < sarowe> i mean useful default.rbbi alternatives
        14:40 < sarowe> but this could also be used to cover other scripts not
                        currently covered
        14:40 < rmuir> i think its actually less useful to customize that?
        14:40 < sarowe> oh?
        14:40 < rmuir> i think its more useful to say 'i want to change latin script
                       without having to worry about fucking up chinese'
        14:40 < sarowe> what do you think people will want to customize
        14:40 < rmuir> like
        14:40 < rmuir> it doesnt give you any additional 'power'
        14:40 < rmuir> it just makes writing the rules easier?
        14:41 < rmuir> so thats why i suggested script/text file pairs
        14:41 < sarowe> aha, so you're advocating leaving the default as-is
        14:41 < sarowe> and interposing script-specific additions
        14:41 < rmuir> well ive customized it this way before, and i'm describing how i
                       did it
        14:41 < sarowe> :) cool, ok
        14:41 < rmuir> because the idea is: i dont want to have to deal with
                       researching the impacts on tons of other languages or dealing
                       with the categories and shit
        14:41 < elyograg> my customizations would involve stopping it from doing
                          something that it currently does -- tokenize on punctuation.
        14:42 < rmuir> right, but you shouldnt change the default to accomplish that
        14:42 < rmuir> because such a thing would make horribly long strings if you got
                       some tibetan text
        14:43 < rmuir> you should just just say 'i want to do this for latin script'
        14:43 < elyograg> much of the discussion here is going right over my head, but
                          you're right, I would only want to make this happen for
                          latin.  punctuation for other languages is outside my
                          understanding. :)
        14:44 < rmuir> right so thats the idea of a script/text file pair
        14:44 < rmuir> you dont have to deal with that stuff. its not more expressive,
                       just makes it simpler
        14:46 < rmuir> what you want to do is easy if the factory adds supprot for that
        14:46 < rmuir> its like the hebrew example, it has two customizations (just a
                       set add/subtraction if i remember)
        14:46 < rmuir> so it wont split on " and '
        14:49 < sarowe> so elyograg's customization could be a copy/paste of the
                        Default.rbbi, and just add the punctuation he wants to keep to
                        the $MidLetter definition
        14:50 < rmuir> yeah. like if we do this there should be some example
                       default.rbbi sitting there so this is easy to do
        14:50 < sarowe> (assuming numbers aren't involved)
        14:51 < rmuir> i personaly try to prefer the tweaking the way you suggest, as
                       just customizing the grammar that way.
        14:51 < rmuir> but you can also do something simpler too (e.g. something that
                       looks more like a regexp, since it only worries about latin...)
        14:52 < sarowe> a range of examples showing both these strategies would be good
        14:52 < rmuir> look at the provided ones :)
        14:52 < sarowe> you mean Hebrew, Khmer, Lao, and Myanmar I assume
        14:52 < rmuir> yes
        14:53 < rmuir> i guess those arent great for examples for users :)
        14:53 < sarowe> right, but naive users won't easily understand those
        14:53 < sarowe> right
        
        Show
        Shawn Heisey added a comment - The IRC conversation: 13:51 < sarowe> btw, icutokenizer is customizable - you can find the default grammar at lucene/analysis/icu/src/data/uax29/Default.rbbi 13:51 < sarowe> the dafault grammar is used when the text is not thai, lao, etc. 13:51 < elyograg> i'm using solr, is any of that exposed there? 13:52 < sarowe> you'd have to rebuild Default.brk and then repackage lucene-icu jar, then include your rebuilt jar in solr classpath 13:52 < sarowe> that's if you don't want to rename it 13:53 < sarowe> more better would be to clone ICUTokenizer sources, tweak Default.rbbi, build and include a diff jar with Solr classpath 13:53 < hoss> sarowe: i don't really know anythign about ICU< but are these ".brk" files loaded at run time? could we make the factory take one as an arg? 13:53 < elyograg> I'll have to look into that. I'd probably make a new class. 13:53 < sarowe> no 13:53 < sarowe> :) 13:53 < sarowe> I meant no to including the .brk file at runtime 13:54 < sarowe> they get compiled into some binary blob 13:54 < sarowe> that gets used by ICU 13:55 < hoss> ok .. if users can create their own *.rbbi file, and then use that to generate a *.brk file, and then use that to generate a binary blob file, could we make the ICU Factory configurable about which blob file to load? 13:55 < sarowe> 'ant genrbbi' under lucene/analysis/icu/ does the binary blob building, using the ICU RBBI rule compiler 13:55 < sarowe> hoss: hmm, potentially, yes 13:56 < sarowe> the .brk file is the binary blob, BTW 13:57 < hoss> sarowe: how long does the binary blob conversaion take? any reason it couldn't be done on startup? 13:57 < hoss> (since it's aparently java code) 13:58 < sarowe> hoss: I think it's pretty fast - single digit seconds 13:59 < hoss> so maybe it could all be done by the factory? 13:59 < hoss> i have no idea if that makes sense .. just spit-balling 13:59 < sarowe> not sure how valuable runtime generation would be though 13:59 < hoss> sure ... i'm just asking because anytime the answer is "you need to rebuild a stock jar" i start wondering if /how we can make that more configurable 13:59 < sarowe> I guess one benefit is that although people would have to know the RBBI syntax, they wouldn't have to know how to drive the Lucene build 14:00 < hoss> right ... but if it's better/easier to just tell them "here's this little java tool, run it" that's fine too 14:00 < hoss> i'm just asking about the flexibility 14:00 < sarowe> right, the current build doesn't have any way to substitute alternatives 14:00 < hoss> if we can eliminate a tool and make it all magic .. people like magic 14:01 < sarowe> I worry about syntax errors in a runtime Solr though 14:01 < sarowe> this likely should be offline 14:01 < sarowe> I mean the .brk compilation 14:02 < hoss> fair enough ... i don't know enough about these files to know how likely it is people will fuck up on editing them 14:02 < sarowe> it's a syntax that only exists for that tool AFAIK 14:02 < hoss> like i said: if there's a tool people can use to generate the blob files, and then point at them from factory config, thta alone sounds like a huge win over "recompile lucene" 14:02 < sarowe> and misplaced curly brackets, e.g., is common enough in any language 14:03 < sarowe> I agree 14:04 < sarowe> the JFlex-based grammars, by contrast, are much more tightly bound with the java - recompilation is unavoidable 14:05 < hoss> i know ... it sucks ... everytime i try to think about how to make a really nicely confiugrable parser the whole jflex/javacc compile to javacode thing depressess the hell out of me 14:06 < sarowe> I think the best we can do there is introduce configuration knobs into the generated java 14:07 < sarowe> for example, the tokenize-punctuation proposal could be implemented as a boolean option on the parser(s) 14:08 < sarowe> so that by default the current behavior happens, but if people ask for it, they could get punctuation tokens 14:08 < sarowe> (actually, whitespace should be tokenized too, not just punctuation - I mean all the stuff that currently gets tossed) 14:10 < hoss> yeah ... just like autoGeneratePhraseQueries ... i'd really like to make whole syntax options more conifgurable though (eg: "should quotes be special syntax?" , "what should ':' do ?" , all the things edismax currently does i na really hacky way, etc...) 14:10 < hoss> that kind of stuff is nearly impossible unless you can built up the grammer from sub-pieces at runtime 14:21 < rmuir> sarowe: sorry just catching up... i dont understand this talk of binary blogs or whatever 14:22 < rmuir> you just need this factory to support ICUTOkenizer(TokenizerConfig) ctor ? 14:22 < rmuir> there is no need for binary blobs or jar rebuilding 14:22 < rmuir> instead some just way to declare that ICUTokenizerConfig via xml. 14:23 < sarowe> rmuir: I'm looking at DefaultICUTokenizerConfig right now 14:23 < sarowe> that would be a pretty complex XML format 14:23 < rmuir> maybe... 14:24 < rmuir> maybe it should just be some way to override the default rather than redefine completely 14:24 < sarowe> and if people want to change rules, they still need to involve the RBBI compiler 14:24 < rmuir> no they dont 14:24 < rmuir> http: //icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#RuleBasedBreakIterator%28java.lang. String %29 14:24 < rmuir> the factory just calls that 14:25 < sarowe> that's not how it's currently done for icutokenizer though 14:25 < sarowe> in lucene I mean 14:25 < rmuir> yeah it is 14:25 < rmuir> look at icutokenizerconfig 14:25 < rmuir> /** Return a breakiterator capable of processing a given script. */ 14:25 < rmuir> public abstract BreakIterator getBreakIterator( int script); 14:25 < rmuir> i dont understand whats confusing about this :) 14:25 < rmuir> if you make your own config, you can return one you made with RBBI( String ) 14:26 < rmuir> the * default * implementation is just optimized via these .brks so it loads faster. 14:26 < sarowe> aha, so as i said, the current impl doesn't do this 14:26 < rmuir> you said the icutokenizer 14:26 < rmuir> it has no impl 14:26 < rmuir> it has a default config really 14:26 < rmuir> the tokenizer doesnt know about any of this 14:26 < sarowe> ok, I get it 14:27 < rmuir> anyway there is a TODO in the factory about this 14:27 < rmuir> would be nice if someone has ideas on what could be a useful format 14:28 < sarowe> cool 14:28 < rmuir> i dont think it needs to support all the possible customizations (e.g. typing things differently and so on) 14:28 < rmuir> just something simple 14:28 < rmuir> maybe some way to "@override" the default one 14:28 < rmuir> like you give it a list of script/textfile pairs 14:32 < rmuir> if you use that RBBI( String ) ctor it fully syntax checks and stuff also 14:32 < sarowe> cool 14:32 < rmuir> actually if you look at the compiler i have to redundantly call that just for that reason :) 14:32 < rmuir> the general approach is you do this once, and then .clone() whatever you make 14:33 < rmuir> like collators 14:35 < sarowe> what's redundant? I don't see that 14:36 < rmuir> in the compiler? 14:36 < rmuir> /* 14:36 < rmuir> * if there is a syntax error, compileRules() may succeed. the way to 14:36 < rmuir> * check is to try to instantiate from the string. additionally if the 14:36 < rmuir> * rules are invalid, you can get a useful syntax error. 14:36 < rmuir> */ 14:36 < rmuir> try { 14:36 < rmuir> new RuleBasedBreakIterator(rules); 14:36 < rmuir> } catch (IllegalArgumentException e) { 14:36 < rmuir> /* 14:36 < rmuir> * do this intentionally, so you don't get a massive stack trace 14:36 < rmuir> * instead, get a useful syntax error! 14:36 < rmuir> */ 14:36 < rmuir> System .err.println(e.getMessage()); 14:36 < rmuir> System .exit(1); 14:36 < rmuir> } 14:36 < sarowe> oh! 14:36 < rmuir> thats why i said its actually the opposite of what you might think 14:36 < sarowe> I was missing the fact that the compiler is in lucene and not in ICU 14:37 < rmuir> compileRules misses the checking 14:37 < rmuir> but the string-ctor checks 14:37 < rmuir> (at least at the time i wrote this thing thats how it worked) 14:37 < sarowe> cool 14:38 < rmuir> anyway, would be cool if we could enhance the factory 14:38 < rmuir> as the rules are documented and what not pretty well 14:38 < sarowe> agreed 14:39 < sarowe> and if people came up with useful alternatives, those could be bundled and chosen from 14:39 < sarowe> i mean useful default .rbbi alternatives 14:40 < sarowe> but this could also be used to cover other scripts not currently covered 14:40 < rmuir> i think its actually less useful to customize that? 14:40 < sarowe> oh? 14:40 < rmuir> i think its more useful to say 'i want to change latin script without having to worry about fucking up chinese' 14:40 < sarowe> what do you think people will want to customize 14:40 < rmuir> like 14:40 < rmuir> it doesnt give you any additional 'power' 14:40 < rmuir> it just makes writing the rules easier? 14:41 < rmuir> so thats why i suggested script/text file pairs 14:41 < sarowe> aha, so you're advocating leaving the default as-is 14:41 < sarowe> and interposing script-specific additions 14:41 < rmuir> well ive customized it this way before, and i'm describing how i did it 14:41 < sarowe> :) cool, ok 14:41 < rmuir> because the idea is: i dont want to have to deal with researching the impacts on tons of other languages or dealing with the categories and shit 14:41 < elyograg> my customizations would involve stopping it from doing something that it currently does -- tokenize on punctuation. 14:42 < rmuir> right, but you shouldnt change the default to accomplish that 14:42 < rmuir> because such a thing would make horribly long strings if you got some tibetan text 14:43 < rmuir> you should just just say 'i want to do this for latin script' 14:43 < elyograg> much of the discussion here is going right over my head, but you're right, I would only want to make this happen for latin. punctuation for other languages is outside my understanding. :) 14:44 < rmuir> right so thats the idea of a script/text file pair 14:44 < rmuir> you dont have to deal with that stuff. its not more expressive, just makes it simpler 14:46 < rmuir> what you want to do is easy if the factory adds supprot for that 14:46 < rmuir> its like the hebrew example, it has two customizations (just a set add/subtraction if i remember) 14:46 < rmuir> so it wont split on " and ' 14:49 < sarowe> so elyograg's customization could be a copy/paste of the Default.rbbi, and just add the punctuation he wants to keep to the $MidLetter definition 14:50 < rmuir> yeah. like if we do this there should be some example default .rbbi sitting there so this is easy to do 14:50 < sarowe> (assuming numbers aren't involved) 14:51 < rmuir> i personaly try to prefer the tweaking the way you suggest, as just customizing the grammar that way. 14:51 < rmuir> but you can also do something simpler too (e.g. something that looks more like a regexp, since it only worries about latin...) 14:52 < sarowe> a range of examples showing both these strategies would be good 14:52 < rmuir> look at the provided ones :) 14:52 < sarowe> you mean Hebrew, Khmer, Lao, and Myanmar I assume 14:52 < rmuir> yes 14:53 < rmuir> i guess those arent great for examples for users :) 14:53 < sarowe> right, but naive users won't easily understand those 14:53 < sarowe> right
        Hide
        Robert Muir added a comment -

        actually a lucene issue (the factory is in analysis/icu) but doesnt matter what jira its on.

        I can try to help with the implementation here, my current problem is how it should look to the user.
        factories take Map<String,String>.

        the best idea i have is:

        rulefiles=Latn.rbbi,Cyrl.rbbi
        

        so thats just one key=value, the value is a list of files and it must follow convention (http://unicode.org/iso15924/iso15924-codes.html).

        Show
        Robert Muir added a comment - actually a lucene issue (the factory is in analysis/icu) but doesnt matter what jira its on. I can try to help with the implementation here, my current problem is how it should look to the user. factories take Map<String,String>. the best idea i have is: rulefiles=Latn.rbbi,Cyrl.rbbi so thats just one key=value, the value is a list of files and it must follow convention ( http://unicode.org/iso15924/iso15924-codes.html ).
        Hide
        Steve Rowe added a comment -

        actually a lucene issue (the factory is in analysis/icu) but doesnt matter what jira its on.

        Crap, I missed that. I'll leave it rather than putting it back though.

        Show
        Steve Rowe added a comment - actually a lucene issue (the factory is in analysis/icu) but doesnt matter what jira its on. Crap, I missed that. I'll leave it rather than putting it back though.
        Hide
        Robert Muir added a comment -

        patch with that above syntax (which i'm not sure I even like).

        may not work: haven't tested at all.

        Show
        Robert Muir added a comment - patch with that above syntax (which i'm not sure I even like). may not work: haven't tested at all.
        Hide
        Shawn Heisey added a comment -

        I tried looking at Default.rbbi so I could determine what part of that file defines the tokenization on punctuation, but the identifiers that are there don't mean very much to me.

        I'd like to give this patch a test spin for my use case - turning off tokenization on standard punctuation. If there is any such thing as exotic punctuation, I can probably just let ICU tokenize on that. What would I define for rulefiles, where should the rbbi file(s) go, and what should be in them?

        Show
        Shawn Heisey added a comment - I tried looking at Default.rbbi so I could determine what part of that file defines the tokenization on punctuation, but the identifiers that are there don't mean very much to me. I'd like to give this patch a test spin for my use case - turning off tokenization on standard punctuation. If there is any such thing as exotic punctuation, I can probably just let ICU tokenize on that. What would I define for rulefiles, where should the rbbi file(s) go, and what should be in them?
        Hide
        Steve Rowe added a comment -

        Patch with more tests and example .rbbi files.

        I changed the rulefiles="..." arg format to relax allowable resource names & locations, e.g. rulefiles="Latn:<arbitrary resource path>, ...".

        I added some logic to ICUTokenizerFactory.parseRules() to retry when ResourceLoader.loadResource() fails, after first prepending a "/" to the resource path, because none of the test resources under lucene/analysis/icu/src/test-files/, which is on the test.classpath, were found. I don't understand Object.getClass().getResourceAsStream(), which is delegated to by ResourceLoader.loadResource() - even resources in the same package as the Object can't be found??? By contrast, Object.getClass().getClassLoader().getResourceAsStream() succeeds in finding resources without first prepending a "/". The ClasspathResourceLoader ctor that allows direct specification of the ClassLoader separately from the Class has private access, though.

        Show
        Steve Rowe added a comment - Patch with more tests and example .rbbi files. I changed the rulefiles="..." arg format to relax allowable resource names & locations, e.g. rulefiles="Latn:<arbitrary resource path>, ..." . I added some logic to ICUTokenizerFactory.parseRules() to retry when ResourceLoader.loadResource() fails, after first prepending a "/" to the resource path, because none of the test resources under lucene/analysis/icu/src/test-files/ , which is on the test.classpath , were found. I don't understand Object.getClass().getResourceAsStream() , which is delegated to by ResourceLoader.loadResource() - even resources in the same package as the Object can't be found??? By contrast, Object.getClass().getClassLoader().getResourceAsStream() succeeds in finding resources without first prepending a "/" . The ClasspathResourceLoader ctor that allows direct specification of the ClassLoader separately from the Class has private access, though.
        Hide
        Steve Rowe added a comment -

        I don't understand Object.getClass().getResourceAsStream(), which is delegated to by ResourceLoader.loadResource() - even resources in the same package as the Object can't be found??? By contrast, Object.getClass().getClassLoader().getResourceAsStream() succeeds in finding resources without first prepending a "/". The ClasspathResourceLoader ctor that allows direct specification of the ClassLoader separately from the Class has private access, though.

        Hmm, I just retried removing the package from the path for resource that it's in the same package as the test class, and it now works (why did I think it didn't? I thought I tried that...). Modified patch attached.

        So I guess getClass().getResourceAsStream() makes sense: it only searches the same package as the class unless you prepend a "/". Should I leave in the "/"-prepending fallback?

        Show
        Steve Rowe added a comment - I don't understand Object.getClass().getResourceAsStream() , which is delegated to by ResourceLoader.loadResource() - even resources in the same package as the Object can't be found??? By contrast, Object.getClass().getClassLoader().getResourceAsStream() succeeds in finding resources without first prepending a "/" . The ClasspathResourceLoader ctor that allows direct specification of the ClassLoader separately from the Class has private access, though. Hmm, I just retried removing the package from the path for resource that it's in the same package as the test class, and it now works (why did I think it didn't? I thought I tried that...). Modified patch attached. So I guess getClass().getResourceAsStream() makes sense: it only searches the same package as the class unless you prepend a "/" . Should I leave in the "/" -prepending fallback?
        Hide
        Robert Muir added a comment -

        Hi, thanks for tackling this! you beat me to getting to the tests.

        Yes, we should remove this /-stuff!!!

        Show
        Robert Muir added a comment - Hi, thanks for tackling this! you beat me to getting to the tests. Yes, we should remove this /-stuff!!!
        Hide
        Uwe Schindler added a comment - - edited

        The rule is simple: Never prepend /.

        For tests we added a special case and that may confuse here: ClasspathResourceLoader with a class in ctor param, there you can pass in the base class from which the packages are load. This is added to make writing tests easy: You can pass in a plain file name and it is loaded from the package of the corresponding class. This is to mimic what we always had in our tests: Loading local class resources:

        // this will load the given file from this' class package:
        new ClassPathResourceLoader(getClass()).openResource("file.txt");
        

        Code like Solr uses FileSystemResourceLoader that wants relative path to the local working directory or uses the classloader, but thats for Solr and other applications like ElasticSearch. Tests should use ClasspathResourceLoader(getClass()) and only pass a file name fro their own package.

        Yes, we should remove this /-stuff!!!

        We can do nothing here, the confusion is created by Java's API by itsself: If you call Class.getResource() without a path (only file name), it loads from same package as the class, if you prepend with "/" it uses the given path as full package name. In contrast, if you directly use the ClassLoader (not Class), you must give a full path, but without a /.

        Show
        Uwe Schindler added a comment - - edited The rule is simple: Never prepend /. For tests we added a special case and that may confuse here: ClasspathResourceLoader with a class in ctor param, there you can pass in the base class from which the packages are load. This is added to make writing tests easy: You can pass in a plain file name and it is loaded from the package of the corresponding class. This is to mimic what we always had in our tests: Loading local class resources: // this will load the given file from this ' class package : new ClassPathResourceLoader(getClass()).openResource( "file.txt" ); Code like Solr uses FileSystemResourceLoader that wants relative path to the local working directory or uses the classloader, but thats for Solr and other applications like ElasticSearch. Tests should use ClasspathResourceLoader(getClass()) and only pass a file name fro their own package. Yes, we should remove this /-stuff!!! We can do nothing here, the confusion is created by Java's API by itsself: If you call Class.getResource() without a path (only file name), it loads from same package as the class, if you prepend with "/" it uses the given path as full package name. In contrast, if you directly use the ClassLoader (not Class), you must give a full path, but without a /.
        Hide
        Uwe Schindler added a comment -

        Can you please remove the workaround inside the factory (prepending "/")? This will break non-tests (e.g. when those classes are loaded from file system). Just only load classes from the local package of the class that was passed into the resourceloader.

        Yes, we should remove this /-stuff!!!

        I misunderstood that. I thought you wanted to fix something else. YES, PLEASE REMOVE, it may break non-tests!

        Show
        Uwe Schindler added a comment - Can you please remove the workaround inside the factory (prepending "/")? This will break non-tests (e.g. when those classes are loaded from file system). Just only load classes from the local package of the class that was passed into the resourceloader. Yes, we should remove this /-stuff!!! I misunderstood that. I thought you wanted to fix something else. YES, PLEASE REMOVE, it may break non-tests!
        Hide
        Steve Rowe added a comment -

        Can you please remove the workaround inside the factory (prepending "/")? This will break non-tests (e.g. when those classes are loaded from file system).

        Yes, we should remove this /-stuff!!!

        I misunderstood that. I thought you wanted to fix something else. YES, PLEASE REMOVE, it may break non-tests!

        Yes, I'll fix - patch forthcoming.

        Show
        Steve Rowe added a comment - Can you please remove the workaround inside the factory (prepending "/")? This will break non-tests (e.g. when those classes are loaded from file system). Yes, we should remove this /-stuff!!! I misunderstood that. I thought you wanted to fix something else. YES, PLEASE REMOVE, it may break non-tests! Yes, I'll fix - patch forthcoming.
        Hide
        Steve Rowe added a comment -

        I just retried removing the package from the path for resource that it's in the same package as the test class, and it now works (why did I think it didn't? I thought I tried that...)

        I now understand why this didn't previously appear to be working - I had been experimenting with substituting superclasses of the test class in the ClasspathResourceLoader ctor, and these were not in the same package as the test class itself, so of course resource loading was failing.

        Show
        Steve Rowe added a comment - I just retried removing the package from the path for resource that it's in the same package as the test class, and it now works (why did I think it didn't? I thought I tried that...) I now understand why this didn't previously appear to be working - I had been experimenting with substituting superclasses of the test class in the ClasspathResourceLoader ctor, and these were not in the same package as the test class itself, so of course resource loading was failing.
        Hide
        Steve Rowe added a comment -

        Patch with /-insanity expunged with prejudice.

        Added standard factory docs with example Solr configs.

        Added CHANGES.txt entry.

        I think it's ready to go.

        Show
        Steve Rowe added a comment - Patch with / -insanity expunged with prejudice. Added standard factory docs with example Solr configs. Added CHANGES.txt entry. I think it's ready to go.
        Hide
        Robert Muir added a comment -

        +1

        Should we stick with test-files or just drop the test rbbis in src/test (i think this is what we normally do).

        In any case, thanks a lot for bringing this to completion!

        Show
        Robert Muir added a comment - +1 Should we stick with test-files or just drop the test rbbis in src/test (i think this is what we normally do). In any case, thanks a lot for bringing this to completion!
        Hide
        Steve Rowe added a comment -

        Should we stick with test-files or just drop the test rbbis in src/test (i think this is what we normally do).

        Good idea, I'll move the rbbi files to the standard location - because of the same-package-for-tests requirement, no reason to have a separate location for test resources.

        Show
        Steve Rowe added a comment - Should we stick with test-files or just drop the test rbbis in src/test (i think this is what we normally do). Good idea, I'll move the rbbi files to the standard location - because of the same-package-for-tests requirement, no reason to have a separate location for test resources.
        Hide
        Steve Rowe added a comment -

        Patch with the rbbi files moved from src/test-files/ to src/test/.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch with the rbbi files moved from src/test-files/ to src/test/ . Committing shortly.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1416617

        SOLR-4123: Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format.

        Show
        Commit Tag Bot added a comment - [trunk commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1416617 SOLR-4123 : Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1416629

        SOLR-4123: Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format. (merged trunk r1416617)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1416629 SOLR-4123 : Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format. (merged trunk r1416617)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_4x.

        Thanks Shawn and Robert!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_4x. Thanks Shawn and Robert!
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Steven Rowe
        http://svn.apache.org/viewvc?view=revision&revision=1416629

        SOLR-4123: Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format. (merged trunk r1416617)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1416629 SOLR-4123 : Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format. (merged trunk r1416617)

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Shawn Heisey
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development