Solr
  1. Solr
  2. SOLR-1657 convert the rest of solr to use the new tokenstream API
  3. SOLR-1677

Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9.

      In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.

      This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene).

      This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

      1. SOLR-1677.patch
        6 kB
        Uwe Schindler
      2. SOLR-1677.patch
        7 kB
        Uwe Schindler
      3. SOLR-1677.patch
        6 kB
        Uwe Schindler
      4. SOLR-1677.patch
        17 kB
        Uwe Schindler
      5. SOLR-1677-lucenetrunk-branch.patch
        25 kB
        Uwe Schindler
      6. SOLR-1677-lucenetrunk-branch-2.patch
        9 kB
        Uwe Schindler
      7. SOLR-1677-lucenetrunk-branch-3.patch
        5 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Patch.

        I did not go through all factories, so maybe more need to be upgraded for matchVersion when switching to Lucene 3.0.

        Show
        Uwe Schindler added a comment - Patch. I did not go through all factories, so maybe more need to be upgraded for matchVersion when switching to Lucene 3.0.
        Hide
        Robert Muir added a comment -

        Hello Uwe, I would like to be able to specify the default, at some global level, for all tokenstreams.

        for example, if i was setting up a new solr configuration, i would want to say 'give me 3.1 support for all tokenstreams by default' ?

        Show
        Robert Muir added a comment - Hello Uwe, I would like to be able to specify the default, at some global level, for all tokenstreams. for example, if i was setting up a new solr configuration, i would want to say 'give me 3.1 support for all tokenstreams by default' ?
        Hide
        Uwe Schindler added a comment -

        Better patch:

        • more dynamic Version map creation
        • improved warning message copied from Lucene's Javadocs on Version.LUCENE_CURRENT.
        Show
        Uwe Schindler added a comment - Better patch: more dynamic Version map creation improved warning message copied from Lucene's Javadocs on Version.LUCENE_CURRENT.
        Hide
        Uwe Schindler added a comment -

        for example, if i was setting up a new solr configuration, i would want to say 'give me 3.1 support for all tokenstreams by default' ?

        I have no idea how to define global properties in schema.xml that apply for all factories. If this is possible the LUCENE_24 else clause and the default value can be changed to the global default (which itsself defaults to Version.LUCENE_24). In this case the parser map (for Lucene 2.9/Java 1.4) on the version enum should also move to a more central page.

        Show
        Uwe Schindler added a comment - for example, if i was setting up a new solr configuration, i would want to say 'give me 3.1 support for all tokenstreams by default' ? I have no idea how to define global properties in schema.xml that apply for all factories. If this is possible the LUCENE_24 else clause and the default value can be changed to the global default (which itsself defaults to Version.LUCENE_24). In this case the parser map (for Lucene 2.9/Java 1.4) on the version enum should also move to a more central page.
        Hide
        Uwe Schindler added a comment -

        Fix problem in one test, because the english stop word set is unmodifiable, so copy it.

        Show
        Uwe Schindler added a comment - Fix problem in one test, because the english stop word set is unmodifiable, so copy it.
        Hide
        Uwe Schindler added a comment -

        New patch with some schema and config hacking. Also new test:

        • As a first hack the solrConfig schema has a new element <luceneMatchVersion> that contains a solr-wide default luceneMatchVersion value that is used as default for QueryParser, Analyzers if not specified different
        • On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now extend SolrCoreAware (and I also allowed these classes to be SolrCoreAware) and get the SolrConfig.
        • Both classes now use the default, if not local set as a param (like in the last patch), but the default is the one got from SolrConfig
        • The parser for config strings was moved to Config
        • Other components like QueryParserFactories can get the default matchVersion in the same way
        • The default is LUCENE_24 as before.

        This is a first idea, how it would work. Open points:

        • should the default be in SolrConfig or in IndexConfig?
        • I did not change the config.xsd file to reflect my change as open discussion
        • all other example config files and schemas should use the default Lucene version shipped with the solr release (currently 2.9). So user that upgrade get their last lucene version their index is compatible with, and new users get the latest config.
        • If users upgrade the default luceneMatchVersion, they have to possibly reindex (esp. when upgrading to LUCENE_31 soon, as new Unicode features in all Tokenizers/Filters)
        Show
        Uwe Schindler added a comment - New patch with some schema and config hacking. Also new test: As a first hack the solrConfig schema has a new element <luceneMatchVersion> that contains a solr-wide default luceneMatchVersion value that is used as default for QueryParser, Analyzers if not specified different On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now extend SolrCoreAware (and I also allowed these classes to be SolrCoreAware) and get the SolrConfig. Both classes now use the default, if not local set as a param (like in the last patch), but the default is the one got from SolrConfig The parser for config strings was moved to Config Other components like QueryParserFactories can get the default matchVersion in the same way The default is LUCENE_24 as before. This is a first idea, how it would work. Open points: should the default be in SolrConfig or in IndexConfig? I did not change the config.xsd file to reflect my change as open discussion all other example config files and schemas should use the default Lucene version shipped with the solr release (currently 2.9). So user that upgrade get their last lucene version their index is compatible with, and new users get the latest config. If users upgrade the default luceneMatchVersion, they have to possibly reindex (esp. when upgrading to LUCENE_31 soon, as new Unicode features in all Tokenizers/Filters)
        Hide
        Hoss Man added a comment -
        • As a first hack the solrConfig schema has a new element <luceneMatchVersion> that contains a solr-wide default luceneMatchVersion value that is used as default for QueryParser, Analyzers if not specified different
        • On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now extend SolrCoreAware (and I also allowed these classes to be SolrCoreAware) and get the SolrConfig.

        I'd really prefer that nothing like this make it into solr.

        One: we've worked pretty hard to make sure that nothing in the analysis code is SolrCoreAware – the goal was to try and keep the schema related code reusable w/o risk of factories adding tendrals that reach deep into the other solr code (it's onbly a matter of time until someone starts refactoring all of the schema related code out of Solr and into a Lucene contrib.

        If we really want to add a new "global" setting for the default match version, it should be in schema.xml, as it pertains to the index itself and how to read/write to the index "properly" and not to the paticularities of how a particular solr installation might be using that data (schema.xml => the nature of the data; solrconfig.xml => the usage of the data)

        Two: I really question the need for a configurable default across all analysis factories. This seems like the type of thing that's going to be changed rarely if ever, and when it is changed each field will need to be considered very carefully to decide wether the "new" behavior is desired over hte "old"

        I suspect the only time anyone is going to upgrade all factories at once is when we rev lucene jars and update the example configs – in that case (and in the case of a user who is happy to blow away all of their data and take the newest, regardless of what it is, for every analyzer) a search and replace seem perfectly appropriate.

        Show
        Hoss Man added a comment - As a first hack the solrConfig schema has a new element <luceneMatchVersion> that contains a solr-wide default luceneMatchVersion value that is used as default for QueryParser, Analyzers if not specified different On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now extend SolrCoreAware (and I also allowed these classes to be SolrCoreAware) and get the SolrConfig. I'd really prefer that nothing like this make it into solr. One: we've worked pretty hard to make sure that nothing in the analysis code is SolrCoreAware – the goal was to try and keep the schema related code reusable w/o risk of factories adding tendrals that reach deep into the other solr code (it's onbly a matter of time until someone starts refactoring all of the schema related code out of Solr and into a Lucene contrib. If we really want to add a new "global" setting for the default match version, it should be in schema.xml, as it pertains to the index itself and how to read/write to the index "properly" and not to the paticularities of how a particular solr installation might be using that data (schema.xml => the nature of the data; solrconfig.xml => the usage of the data) Two: I really question the need for a configurable default across all analysis factories. This seems like the type of thing that's going to be changed rarely if ever, and when it is changed each field will need to be considered very carefully to decide wether the "new" behavior is desired over hte "old" I suspect the only time anyone is going to upgrade all factories at once is when we rev lucene jars and update the example configs – in that case (and in the case of a user who is happy to blow away all of their data and take the newest, regardless of what it is, for every analyzer) a search and replace seem perfectly appropriate.
        Hide
        Mark Miller added a comment -

        it should be in schema.xml, as it pertains to the index itself and how to read/write to the index "properly" and not to the paticularities of how a particular solr installation might be using that data

        Is that true? Many times so far, but Version is not limited to such things. It can be used for far more than how to read/write the index properly.

        Show
        Mark Miller added a comment - it should be in schema.xml, as it pertains to the index itself and how to read/write to the index "properly" and not to the paticularities of how a particular solr installation might be using that data Is that true? Many times so far, but Version is not limited to such things. It can be used for far more than how to read/write the index properly.
        Hide
        Erik Hatcher added a comment -

        Another comment on this... Solr supports using an Analyzer also, but only ones with zero-arg constructors. It would be nice if this Version support also allowed for Analyzers (say SmartChineseAnalyzer) to be used also directly. I don't think this patch accounts for this case, does it?

        Show
        Erik Hatcher added a comment - Another comment on this... Solr supports using an Analyzer also, but only ones with zero-arg constructors. It would be nice if this Version support also allowed for Analyzers (say SmartChineseAnalyzer) to be used also directly. I don't think this patch accounts for this case, does it?
        Hide
        Uwe Schindler added a comment -

        Thanks for the hint. This means it can instantiate an analyzer via reflection and uses the zero-arg ctor, which is no longer available. So with Lucene 3.0 it will no longer work at all. As I have not so much experience with hacking Solr, I did not recognize this.

        In my own project I have the same mechanism, for that i did a reflection-analysis of the loaded class and use the ctor with Version, if not avail an empty ctor.

        Show
        Uwe Schindler added a comment - Thanks for the hint. This means it can instantiate an analyzer via reflection and uses the zero-arg ctor, which is no longer available. So with Lucene 3.0 it will no longer work at all. As I have not so much experience with hacking Solr, I did not recognize this. In my own project I have the same mechanism, for that i did a reflection-analysis of the loaded class and use the ctor with Version, if not avail an empty ctor.
        Hide
        Hoss Man added a comment -

        Is that true? Many times so far, but Version is not limited to such things. It can be used for far more than how to read/write the index properly.

        Perhaps, but that would be a very different usage ... even if Lucene-Java uses the same o.a.l.util.Version class for driving Analyzers/Tokenizers/TokenFilters and IndexWriters/MergeScheduler/QueryParser ... but those are very different things in Solr land ... in a replication setup, two different instances might use very different "Version" values for the IndexWriter/MergeScheduler/QueryParser (configured in solrconfig.xml) but they should have identical schema.xml files and identical (versioned) analyzer setttings.

        But as i said: i don't see any compelling need for a "schema global" Version anyway (let alone an instance wide global that applies to both solrconfig.xml and schema.xml)

        Show
        Hoss Man added a comment - Is that true? Many times so far, but Version is not limited to such things. It can be used for far more than how to read/write the index properly. Perhaps, but that would be a very different usage ... even if Lucene-Java uses the same o.a.l.util.Version class for driving Analyzers/Tokenizers/TokenFilters and IndexWriters/MergeScheduler/QueryParser ... but those are very different things in Solr land ... in a replication setup, two different instances might use very different "Version" values for the IndexWriter/MergeScheduler/QueryParser (configured in solrconfig.xml) but they should have identical schema.xml files and identical (versioned) analyzer setttings. But as i said: i don't see any compelling need for a "schema global" Version anyway (let alone an instance wide global that applies to both solrconfig.xml and schema.xml)
        Hide
        Uwe Schindler added a comment -

        The problem is the default value. If you leave out the version parameter instance-wise, you will get 2.4. And because of that all solr users will get stuck with that version and will never upgrade (because they leave the default and do not specify a different value). Because of backwards compatibility, we are limited to this version number as default value.

        The schema/config global version is the global default used by all instances, that do not specify a different value. By that we can ship the default solconfig/schema.xml with the latest possible lucene version, but users upgrading will keep their default value.

        I repeat: with instance-wise config, nobody will ever use it for new analyzers. With a global default, there is only one place that sets the version, which is also valid for user-added tokenizer chains.

        For the SolrCore problem: For analyzers the idea its, that the default Version constant is automatically passed to all tokenizers in the param map automatically. Local values overwrite the key in the map. But this would only apply the analyzers. Other usages of Version at other places (QP, IW) still need SolrCore. But we can move the SolrCoreAware to the schema classes and not make every TokenFilter/Tokenizer SolrCoreAware.

        Show
        Uwe Schindler added a comment - The problem is the default value. If you leave out the version parameter instance-wise, you will get 2.4. And because of that all solr users will get stuck with that version and will never upgrade (because they leave the default and do not specify a different value). Because of backwards compatibility, we are limited to this version number as default value. The schema/config global version is the global default used by all instances, that do not specify a different value. By that we can ship the default solconfig/schema.xml with the latest possible lucene version, but users upgrading will keep their default value. I repeat: with instance-wise config, nobody will ever use it for new analyzers. With a global default, there is only one place that sets the version, which is also valid for user-added tokenizer chains. For the SolrCore problem: For analyzers the idea its, that the default Version constant is automatically passed to all tokenizers in the param map automatically. Local values overwrite the key in the map. But this would only apply the analyzers. Other usages of Version at other places (QP, IW) still need SolrCore. But we can move the SolrCoreAware to the schema classes and not make every TokenFilter/Tokenizer SolrCoreAware.
        Hide
        Robert Muir added a comment -

        But as i said: i don't see any compelling need for a "schema global" Version anyway (let alone an instance wide global that applies to both solrconfig.xml and schema.xml)

        just like Uwe says this is the problem with having no default

        If the default Version is going to be 2.4, I would like a global setting so that I get bugfixes and improvements, because a few things have happened to this code since 2.4.

        I also do not want to list it 10,000 times, but its not enough to make the default Version the latest to fix this problem.

        I want my config to be wired to '2.9' or whatever, so that when upgrading, everything continues to work. Why are you so against a default value?

        Show
        Robert Muir added a comment - But as i said: i don't see any compelling need for a "schema global" Version anyway (let alone an instance wide global that applies to both solrconfig.xml and schema.xml) just like Uwe says this is the problem with having no default If the default Version is going to be 2.4, I would like a global setting so that I get bugfixes and improvements, because a few things have happened to this code since 2.4. I also do not want to list it 10,000 times, but its not enough to make the default Version the latest to fix this problem. I want my config to be wired to '2.9' or whatever, so that when upgrading, everything continues to work. Why are you so against a default value?
        Hide
        Hoss Man added a comment -

        The problem is the default value. If you leave out the version parameter instance-wise, you will get 2.4. And because of that all solr users will get stuck with that version and will never upgrade (because they leave the default and do not specify a different value).

        That feels like a missleading statement ... the "Version" property on these objects is really more about getting the "recommended" behavior as of a particular version of Lucene ... saying that users will be "stuck with that version" is like saying users will be "stuck with StandardAnalyzer" instead of getting "NewHotnessAnalyzer" because they have to edit their config to use the newer/better analyzer – Lucene-Java has opted to use a Version property on existing classes instead of adding new classes, but it's still conceptually the same thing: they get the bahavior they've always gotten, unless they change their config to get something different.

        Besides which: 99.9% of Solr users copy the example config when they first start using Solr: we can set a "version" property on every Analyzer/Factory used in the example schema.xml and update them all when we upgrade the Lucene jars just as easily as we can update a single "global" value (it's a search+replaceAll instead of a search+replace)

        Why are you so against a default value?

        My concern is that it introduces action at a distance – and not in a good way.

        Here's the scenerio that seems garunteed to happen quite a bit if we add some new <luceneAnalyzerVersionDefault/> syntax to schema.xml...

        <luceneAnalyzerVersionDefault>2.9</luceneAnalyzerVersionDefault> is added to the example schema.xml, and users start using it as a result of copying/modifying the example configs. Time passes, new bugs are fixed, and the example configs evolve to contain <luceneAnalyzerVersionDefault>3.4</luceneAnalyzerVersionDefault>

        A little while after that, User Bob emails solr-user with a question like...

        Hey, I'm using FooTokenFilterFactory and i noticed that at query time i see behaviorX when it really seems like i should see BehaviorY

        User Carl helpfully replies...

        That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior – but if you used FooTokenFilterFactory in an index analyzer you'll need to reindex.

        Bob makes the change to 3.2 that Carl recommended, and is happy to see now his queries work. He only uses FooTokenFilterFactory at query time, so he doens't bother to reindex, and every thing seems fine.

        What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in hi's schema.xml file, Bob is also using the YakTokenizerFactory on a differnet field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. Now some documents/queries that use yakField are failing – and failing silently.

        Things just get a lot simpler when all of the configuration for an Analyzer, TokenizerFactory, or Tokenizer are all explict in their declaration – indirect initialization is fine, as long as it's obvious. Ie: <field/> declarations referencing fieldTypes by name – It's easy to fuck up a bunch of fields by making a single change to one fieldType, but at least you can grep for the name of the fieldType to see all the fields you are affecting.

        Even if "Carl" knows/remembers to warn "Bob" that changing <luceneAnalyzerVersionDefault/> might change/break other things in his schema.xml the situation doesn't get much better: Uless Bob (or Carl) skim the code for every Analyzer, Tokenizer, and TokenFilter used in Bob's schema, they can't be sure what might get affected by making a small increase to the "global" luceneAnalyzerVersion setting ... which means the only safe thing for Bob to do is to set the property individual on the one place he really wants to make the change.

        So why have the "global" in the first place? It really just seems like more trouble then it's worth.

        Show
        Hoss Man added a comment - The problem is the default value. If you leave out the version parameter instance-wise, you will get 2.4. And because of that all solr users will get stuck with that version and will never upgrade (because they leave the default and do not specify a different value). That feels like a missleading statement ... the "Version" property on these objects is really more about getting the "recommended" behavior as of a particular version of Lucene ... saying that users will be "stuck with that version" is like saying users will be "stuck with StandardAnalyzer" instead of getting "NewHotnessAnalyzer" because they have to edit their config to use the newer/better analyzer – Lucene-Java has opted to use a Version property on existing classes instead of adding new classes, but it's still conceptually the same thing: they get the bahavior they've always gotten, unless they change their config to get something different. Besides which: 99.9% of Solr users copy the example config when they first start using Solr: we can set a "version" property on every Analyzer/Factory used in the example schema.xml and update them all when we upgrade the Lucene jars just as easily as we can update a single "global" value (it's a search+replaceAll instead of a search+replace) Why are you so against a default value? My concern is that it introduces action at a distance – and not in a good way. Here's the scenerio that seems garunteed to happen quite a bit if we add some new <luceneAnalyzerVersionDefault/> syntax to schema.xml... <luceneAnalyzerVersionDefault>2.9</luceneAnalyzerVersionDefault> is added to the example schema.xml, and users start using it as a result of copying/modifying the example configs. Time passes, new bugs are fixed, and the example configs evolve to contain <luceneAnalyzerVersionDefault>3.4</luceneAnalyzerVersionDefault> A little while after that, User Bob emails solr-user with a question like... Hey, I'm using FooTokenFilterFactory and i noticed that at query time i see behaviorX when it really seems like i should see BehaviorY User Carl helpfully replies... That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior – but if you used FooTokenFilterFactory in an index analyzer you'll need to reindex. Bob makes the change to 3.2 that Carl recommended, and is happy to see now his queries work. He only uses FooTokenFilterFactory at query time, so he doens't bother to reindex, and every thing seems fine. What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in hi's schema.xml file, Bob is also using the YakTokenizerFactory on a differnet field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. Now some documents/queries that use yakField are failing – and failing silently. Things just get a lot simpler when all of the configuration for an Analyzer, TokenizerFactory, or Tokenizer are all explict in their declaration – indirect initialization is fine, as long as it's obvious. Ie: <field/> declarations referencing fieldTypes by name – It's easy to fuck up a bunch of fields by making a single change to one fieldType, but at least you can grep for the name of the fieldType to see all the fields you are affecting. Even if "Carl" knows/remembers to warn "Bob" that changing <luceneAnalyzerVersionDefault/> might change/break other things in his schema.xml the situation doesn't get much better: Uless Bob (or Carl) skim the code for every Analyzer, Tokenizer, and TokenFilter used in Bob's schema, they can't be sure what might get affected by making a small increase to the "global" luceneAnalyzerVersion setting ... which means the only safe thing for Bob to do is to set the property individual on the one place he really wants to make the change. So why have the "global" in the first place? It really just seems like more trouble then it's worth.
        Hide
        Robert Muir added a comment -

        User Carl helpfully replies...

        That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior - but if you used FooTokenFilterFactory in an index analyzer you'll need to reindex.

        User Carl isn't helpful, user Carl is an idiot.

        The javadoc of Version in lucene clearly says:

         * <p><b>WARNING</b>: When changing the version parameter
         * that you supply to components in Lucene, do not simply
         * change the version at search-time, but instead also adjust
         * your indexing code to match, and re-index.
        

        User Carl could also tell Bob that its ok to index with ArabicAnalyzer and query with ChineseAnalyzer, this kind of stupid theoretical situation isn't any kind of valid logical argument against having a default value for this.

        Show
        Robert Muir added a comment - User Carl helpfully replies... That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior - but if you used FooTokenFilterFactory in an index analyzer you'll need to reindex. User Carl isn't helpful, user Carl is an idiot. The javadoc of Version in lucene clearly says: * <p><b>WARNING</b>: When changing the version parameter * that you supply to components in Lucene, do not simply * change the version at search-time, but instead also adjust * your indexing code to match, and re-index. User Carl could also tell Bob that its ok to index with ArabicAnalyzer and query with ChineseAnalyzer, this kind of stupid theoretical situation isn't any kind of valid logical argument against having a default value for this.
        Hide
        Hoss Man added a comment -

        User Carl isn't helpful, user Carl is an idiot.

        Oh come on now ... that's not really a fair criticism of the example: there are plenty of legitimate ways to use (some) TokenFilters only at search time and I specifically structured my example to point out potential problems in cases just like that – Carl was very clear that "if you used FooTokenFilterFactory in an index analyzer you'll need to reindex."

        But fine, I'll amend my example to do it your way...

        ...
        Bob Asks his question (see previous example)

        User Carl is on vacation and never sees Bob's email

        User Dwight helpfully replies...

        That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior - but you must reindex all of your data after you make this change.

        Bob makes the change to 3.2 that Carl recommended, reindexes all of his data, and is happy to see now his queries work and every thing seems fine.

        What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in his schema.xml file, Bob is also using the YakTokenizerFactory on a differnet field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. This change is generally considered "better" behavior then YakTokenizer had before, but in combination with another TokenFilter Bob is using on the yakField it causes behavior that is not what Bob wants. Now some types of queries that use the yakField are failing, and failing silently.

        You could now argue that User Dwight is an idiot because he didn't warn Bob that other Analyzers/Tokenizers/TokenFilters might be affected. But that just leads us to scenerious that re-iterates my point that this type of global value is something that would be dangerous to ever change....

        ...
        Bob Asks his question (see previous examples)

        User Carl has unsubscribed from the solr-user list (because a Bill Murray look-a-like hurt his feelings) and never sees Bob's email.

        User Dwight is on vacation and never sees Bob's email.

        User Ernest helpfully replies...

        That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior – *But this is Very VERY Dangerous: It could potentially affect the behavior of other analyzers you are using. You need to check the javadocs for each and every Analyzer, Tokenizer, and TokenFilter you use to see what their behavior is with various values of the Version property before you make a change like this.

        Personally I never change the value of <luceneAnalyzerVersionDefault/> once i have an existing schema.xml file. Instead i suggest you add luceneVersion="3.2" to your <filter class="solr.FooTokenFilterFactory /> declaration so that you know you are only changing the behavior you want to change.

        BTW: You must reindex all of your data after doing either of these things in order for it to work.

        Bob follow's Ernest's advice, and everything is fine .. but Bob is left wondering what the point is of a config option that's so dangerous to change, and wishes there was an easy way to know which of his Analyzers and Factories are depending on that scary "gobal" value.

        At the end of the day it just seems like a bigger risk then a feature ... I feel like i must still be misunderstanding the motivation you guys have for adding it, because it really seems like it boils down to "easier then having the property 2.9 set on every analyzer/factory"

        I guess i ultimately have no stringent objection to a global schema.xml seting like this existing as an expert level feature (for people who want really compact config files i guess), I just don't want to see it used in the example schema.xml file(s) where it's likely to screw novice users over.

        Show
        Hoss Man added a comment - User Carl isn't helpful, user Carl is an idiot. Oh come on now ... that's not really a fair criticism of the example: there are plenty of legitimate ways to use (some) TokenFilters only at search time and I specifically structured my example to point out potential problems in cases just like that – Carl was very clear that "if you used FooTokenFilterFactory in an index analyzer you'll need to reindex." But fine, I'll amend my example to do it your way... ... Bob Asks his question (see previous example) User Carl is on vacation and never sees Bob's email User Dwight helpfully replies... That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior - but you must reindex all of your data after you make this change. Bob makes the change to 3.2 that Carl recommended, reindexes all of his data, and is happy to see now his queries work and every thing seems fine. What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in his schema.xml file, Bob is also using the YakTokenizerFactory on a differnet field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. This change is generally considered "better" behavior then YakTokenizer had before, but in combination with another TokenFilter Bob is using on the yakField it causes behavior that is not what Bob wants. Now some types of queries that use the yakField are failing, and failing silently . You could now argue that User Dwight is an idiot because he didn't warn Bob that other Analyzers/Tokenizers/TokenFilters might be affected. But that just leads us to scenerious that re-iterates my point that this type of global value is something that would be dangerous to ever change.... ... Bob Asks his question (see previous examples) User Carl has unsubscribed from the solr-user list (because a Bill Murray look-a-like hurt his feelings) and never sees Bob's email. User Dwight is on vacation and never sees Bob's email. User Ernest helpfully replies... That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your <luceneAnalyzerVersionDefault/> value to 3.1 (or 3.2) you'll get the newer/better behavior – *But this is Very VERY Dangerous: It could potentially affect the behavior of other analyzers you are using. You need to check the javadocs for each and every Analyzer, Tokenizer, and TokenFilter you use to see what their behavior is with various values of the Version property before you make a change like this. Personally I never change the value of <luceneAnalyzerVersionDefault/> once i have an existing schema.xml file. Instead i suggest you add luceneVersion="3.2" to your <filter class="solr.FooTokenFilterFactory /> declaration so that you know you are only changing the behavior you want to change. BTW: You must reindex all of your data after doing either of these things in order for it to work. Bob follow's Ernest's advice, and everything is fine .. but Bob is left wondering what the point is of a config option that's so dangerous to change, and wishes there was an easy way to know which of his Analyzers and Factories are depending on that scary "gobal" value. At the end of the day it just seems like a bigger risk then a feature ... I feel like i must still be misunderstanding the motivation you guys have for adding it, because it really seems like it boils down to "easier then having the property 2.9 set on every analyzer/factory" I guess i ultimately have no stringent objection to a global schema.xml seting like this existing as an expert level feature (for people who want really compact config files i guess), I just don't want to see it used in the example schema.xml file(s) where it's likely to screw novice users over.
        Hide
        Robert Muir added a comment -

        Oh come on now ... that's not really a fair criticism of the example: there are plenty of legitimate ways to use (some) TokenFilters only at search time and I specifically structured my example to point out potential problems in cases just like that - Carl was very clear that "if you used FooTokenFilterFactory in an index analyzer you'll need to reindex."

        I disagree, Version applies to all of lucene (even more than tokenstreams), so for Carl to imply that you don't need to reindex by bumping Version simply because you aren't using X or Y or Z, for that he should be renamed Oscar.

        You could now argue that User Dwight is an idiot because he didn't warn Bob that other Analyzers/Tokenizers/TokenFilters might be affected. But that just leads us to scenerious that re-iterates my point that this type of global value is something that would be dangerous to ever change....

        Yeah, I guess I don't think he is an idiot. I just think he is a moron for suggesting such a thing without warning of the consequences.

        Personally I never change the value of <luceneAnalyzerVersionDefault/> once i have an existing schema.xml file. Instead i suggest you add luceneVersion="3.2" to your <filter class="solr.FooTokenFilterFactory /> declaration so that you know you are only changing the behavior you want to change.

        Good for Ernest, i guess he is probably using Windows 3.1 still too because he doesn't want to upgrade ever. Unless Ernest carefully reads Lucene CHANGES also and reads all the Solr source code and knows which solr features are tied to which lucene features, because its not obvious at all: i.e. solr's snowball factory doesn't use lucene's snowball, etc etc.

        At the end of the day it just seems like a bigger risk then a feature ... I feel like i must still be misunderstanding the motivation you guys have for adding it, because it really seems like it boils down to "easier then having the property 2.9 set on every analyzer/factory"

        Yes you are right, personally I don't want all users to be stuck with Version.LUCENE_24 forever.

        Show
        Robert Muir added a comment - Oh come on now ... that's not really a fair criticism of the example: there are plenty of legitimate ways to use (some) TokenFilters only at search time and I specifically structured my example to point out potential problems in cases just like that - Carl was very clear that "if you used FooTokenFilterFactory in an index analyzer you'll need to reindex." I disagree, Version applies to all of lucene (even more than tokenstreams), so for Carl to imply that you don't need to reindex by bumping Version simply because you aren't using X or Y or Z, for that he should be renamed Oscar. You could now argue that User Dwight is an idiot because he didn't warn Bob that other Analyzers/Tokenizers/TokenFilters might be affected. But that just leads us to scenerious that re-iterates my point that this type of global value is something that would be dangerous to ever change.... Yeah, I guess I don't think he is an idiot. I just think he is a moron for suggesting such a thing without warning of the consequences. Personally I never change the value of <luceneAnalyzerVersionDefault/> once i have an existing schema.xml file. Instead i suggest you add luceneVersion="3.2" to your <filter class="solr.FooTokenFilterFactory /> declaration so that you know you are only changing the behavior you want to change. Good for Ernest, i guess he is probably using Windows 3.1 still too because he doesn't want to upgrade ever. Unless Ernest carefully reads Lucene CHANGES also and reads all the Solr source code and knows which solr features are tied to which lucene features, because its not obvious at all: i.e. solr's snowball factory doesn't use lucene's snowball, etc etc. At the end of the day it just seems like a bigger risk then a feature ... I feel like i must still be misunderstanding the motivation you guys have for adding it, because it really seems like it boils down to "easier then having the property 2.9 set on every analyzer/factory" Yes you are right, personally I don't want all users to be stuck with Version.LUCENE_24 forever.
        Hide
        Uwe Schindler added a comment - - edited

        In my opinion, the default in solrconfig.xml should be possible to set, because there is currently no requirement to set a version for all TS components. This default is in the shipped solrconfig.xml the version of the shipped lucene version. so new users can use the default config and extend it like learned in all courses and books about solr. They do not need to care about the version.

        If they upgrade their lucene version, their config keeps stuck on the previous seeting and they are fine. If they want to change some of the components (like query parser, index writer, index reader – flex!!!), they can do it locally. So Bob could change like Ernest proposed.

        If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for Lucene < 3 is removed, then all users cry. With a default version set to 2.4, they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 is no longer available as enum constant.

        If you really do not want to have a default version in config (not schema, because it applies to all lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components. As there may be tokenstream components not from lucene, make this attribute in the schema only mandatory for lucene-streams (this can be done by my initial patch, too: if the matchVersion property is missing then the matchVersion will get NULL and the factory should thow IAE if required. In my original patch, only the parsing code should be moved out of the factory into a util class in solr. Maybe also possible to parse "x.y"-style versions).

        The problem here: Users upgrading from solr 1.4 will suddenly get errors, because their configs get invalid. Ahh, and because they are stupid they add LUCENE_29 (from where should they know that Solr 1.4 used Lucene 2.4 compatibility?). And then the mailing list gets flooded by questions because suddenly the configs fail to produce results with old indexes.

        Show
        Uwe Schindler added a comment - - edited In my opinion, the default in solrconfig.xml should be possible to set, because there is currently no requirement to set a version for all TS components. This default is in the shipped solrconfig.xml the version of the shipped lucene version. so new users can use the default config and extend it like learned in all courses and books about solr. They do not need to care about the version. If they upgrade their lucene version, their config keeps stuck on the previous seeting and they are fine. If they want to change some of the components (like query parser, index writer, index reader – flex!!!), they can do it locally. So Bob could change like Ernest proposed. If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for Lucene < 3 is removed, then all users cry. With a default version set to 2.4, they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 is no longer available as enum constant. If you really do not want to have a default version in config (not schema, because it applies to all lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components. As there may be tokenstream components not from lucene, make this attribute in the schema only mandatory for lucene-streams (this can be done by my initial patch, too: if the matchVersion property is missing then the matchVersion will get NULL and the factory should thow IAE if required. In my original patch, only the parsing code should be moved out of the factory into a util class in solr. Maybe also possible to parse "x.y"-style versions). The problem here: Users upgrading from solr 1.4 will suddenly get errors, because their configs get invalid. Ahh, and because they are stupid they add LUCENE_29 (from where should they know that Solr 1.4 used Lucene 2.4 compatibility?). And then the mailing list gets flooded by questions because suddenly the configs fail to produce results with old indexes.
        Hide
        Hoss Man added a comment -

        Version applies to all of lucene (even more than tokenstreams), so for Carl to imply that you don't need to reindex by bumping Version simply because you aren't using X or Y or Z, for that he should be renamed Oscar.

        Ok, fair enough ... i was supposing in that example that since i called it <luceneAnalyzerVersionDefault/> it was clearly specific to analysis objects in schema.xml and didn't affect any of the other things Version is used for (which would be specified in solrconfig.xml)

        i guess he is probably using Windows 3.1 still too because he doesn't want to upgrade ever.

        No, he uses an OS where he can upgrade indivudal things individually with clear implications – he sets luceneMatchVersion="2.9" on each and every <analyzer/>, <tokenizer/> and <filter/> that he declares in his schema so that he knows exactly what behavior is changing when he modifies any of them.

        personally I don't want all users to be stuck with Version.LUCENE_24 forever.

        I still must be missing something? ... why would all users be stuck with Version.LUCENE_24 forever?

        I'm not advocating that we don't allow a way to specify Version, i'm saying that having a global value for it that affects things opaquely sounds dangerous – we should certianly have a way for people to specify the Version they want on each of the objects that care, but it shouldn't be global. The "luceneMatchVersion" property that Uwe added to BaseTokenizerFactory and BaseTokenFilterFactory in his patch seems perfect to me, it's just the SolrCoreAware / core.getSolrConfig().luceneMatchVersion that i think is a bad idea.

        If we modify the <analyzer/> initialization to allow constructor args as Erik suggested (I'm pretty sure there's already code in Solr to do this, we just aren't using it for Analyzers) then we should be good to go for everything in schema.xml

        If anything declared in solrconfig.xml starts caring about Version (QParser, SolrIndexWriter, etc...) then likewise it should get a "luceneMatchVersion" init property as well. No one will ever be "stuck" with LUCENE_24, but they won't be surprised by behavior changes either.

        If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1.

        I don't believe that. Almost every solr user on the planet starts with the example configs. if the example configs start specifying "luceneMatchVersion=2.9" on every analyzer and factory then people will care about Version just as much as they care about the stopwords.txt file that ships with solr – that may be not at all, or it may be a lot, but it will be up to them, and it will be obvious to them, because it's right there in the declaration where they can see it, and easy for them to refrence and recognize that changing that value will affect things.

        If you really do not want to have a default version in config (not schema, because it applies to all lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components.

        I'm totally on board with that idea in the long run – but there are ways to get there gradually that are back compatible with existing configs. Individual factories that care about luceneMatchVersion should absolutely start warning on startup that users should set luceneMatchVersion to get newer/better behavior may be available if it is unset (or doesn't match the current value of Version.LUCENE_CURRENT) and provide a URL for a wiki page somewhere where more detail is available. The Analyzer init code can do likewise if if sees an <analyzer class=.../> being inited w/ a constructor that takes in a "Version" which is using an "old" value.

        Show
        Hoss Man added a comment - Version applies to all of lucene (even more than tokenstreams), so for Carl to imply that you don't need to reindex by bumping Version simply because you aren't using X or Y or Z, for that he should be renamed Oscar. Ok, fair enough ... i was supposing in that example that since i called it <luceneAnalyzerVersionDefault/> it was clearly specific to analysis objects in schema.xml and didn't affect any of the other things Version is used for (which would be specified in solrconfig.xml) i guess he is probably using Windows 3.1 still too because he doesn't want to upgrade ever. No, he uses an OS where he can upgrade indivudal things individually with clear implications – he sets luceneMatchVersion="2.9" on each and every <analyzer/> , <tokenizer/> and <filter/> that he declares in his schema so that he knows exactly what behavior is changing when he modifies any of them. personally I don't want all users to be stuck with Version.LUCENE_24 forever. I still must be missing something? ... why would all users be stuck with Version.LUCENE_24 forever? I'm not advocating that we don't allow a way to specify Version, i'm saying that having a global value for it that affects things opaquely sounds dangerous – we should certianly have a way for people to specify the Version they want on each of the objects that care, but it shouldn't be global. The "luceneMatchVersion" property that Uwe added to BaseTokenizerFactory and BaseTokenFilterFactory in his patch seems perfect to me, it's just the SolrCoreAware / core.getSolrConfig().luceneMatchVersion that i think is a bad idea. If we modify the <analyzer/> initialization to allow constructor args as Erik suggested (I'm pretty sure there's already code in Solr to do this, we just aren't using it for Analyzers) then we should be good to go for everything in schema.xml If anything declared in solrconfig.xml starts caring about Version (QParser, SolrIndexWriter, etc...) then likewise it should get a "luceneMatchVersion" init property as well. No one will ever be "stuck" with LUCENE_24, but they won't be surprised by behavior changes either. If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1. I don't believe that. Almost every solr user on the planet starts with the example configs. if the example configs start specifying "luceneMatchVersion=2.9" on every analyzer and factory then people will care about Version just as much as they care about the stopwords.txt file that ships with solr – that may be not at all, or it may be a lot, but it will be up to them, and it will be obvious to them, because it's right there in the declaration where they can see it, and easy for them to refrence and recognize that changing that value will affect things. If you really do not want to have a default version in config (not schema, because it applies to all lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components. I'm totally on board with that idea in the long run – but there are ways to get there gradually that are back compatible with existing configs. Individual factories that care about luceneMatchVersion should absolutely start warning on startup that users should set luceneMatchVersion to get newer/better behavior may be available if it is unset (or doesn't match the current value of Version.LUCENE_CURRENT) and provide a URL for a wiki page somewhere where more detail is available. The Analyzer init code can do likewise if if sees an <analyzer class=.../> being inited w/ a constructor that takes in a "Version" which is using an "old" value.
        Hide
        Robert Muir added a comment -

        No, he uses an OS where he can upgrade indivudal things individually with clear implications - he sets luceneMatchVersion="2.9" on each and every <analyzer/>, <tokenizer/> and <filter/> that he declares in his schema so that he knows exactly what behavior is changing when he modifies any of them.

        Yeah, but this isnt how Version works in lucene either, please see below

        I'm not advocating that we don't allow a way to specify Version, i'm saying that having a global value for it that affects things opaquely sounds dangerous - we should certianly have a way for people to specify the Version they want on each of the objects that care, but it shouldn't be global. The "luceneMatchVersion" property that Uwe added to BaseTokenizerFactory and BaseTokenFilterFactory in his patch seems perfect to me, it's just the SolrCoreAware / core.getSolrConfig().luceneMatchVersion that i think is a bad idea.

        And I disagree, I think that the per-tokenfilter matchVersion should be the expert use, with the default global Version being the standard use.

        I don't think Version is intended so you can use X.Y on this part and Y.Z on this part and have any chance of anything working, for example it controls position increments on stopfilter but also in queryparser, if you use wacky combinations, things might not work.

        And I personally don't see anyone putting effort into supporting this either, because its enough to supply the back compat for previous versions, but not some cross product of all possible versions. this is too much. sometimes things interact in ways we cannot detect automatically (such as the query parser phrasequery / stopfilter thing), its my understanding that things like this are why Version was created in the first place.

        Show
        Robert Muir added a comment - No, he uses an OS where he can upgrade indivudal things individually with clear implications - he sets luceneMatchVersion="2.9" on each and every <analyzer/>, <tokenizer/> and <filter/> that he declares in his schema so that he knows exactly what behavior is changing when he modifies any of them. Yeah, but this isnt how Version works in lucene either, please see below I'm not advocating that we don't allow a way to specify Version, i'm saying that having a global value for it that affects things opaquely sounds dangerous - we should certianly have a way for people to specify the Version they want on each of the objects that care, but it shouldn't be global. The "luceneMatchVersion" property that Uwe added to BaseTokenizerFactory and BaseTokenFilterFactory in his patch seems perfect to me, it's just the SolrCoreAware / core.getSolrConfig().luceneMatchVersion that i think is a bad idea. And I disagree, I think that the per-tokenfilter matchVersion should be the expert use, with the default global Version being the standard use. I don't think Version is intended so you can use X.Y on this part and Y.Z on this part and have any chance of anything working, for example it controls position increments on stopfilter but also in queryparser, if you use wacky combinations, things might not work. And I personally don't see anyone putting effort into supporting this either, because its enough to supply the back compat for previous versions, but not some cross product of all possible versions. this is too much. sometimes things interact in ways we cannot detect automatically (such as the query parser phrasequery / stopfilter thing), its my understanding that things like this are why Version was created in the first place.
        Hide
        Hoss Man added a comment -

        I don't think Version is intended so you can use X.Y on this part and Y.Z on this part and have any chance of anything working, for example it controls position increments on stopfilter but also in queryparser, if you use wacky combinations, things might not work.

        How is that any different from letting users pass any Analyzer they want to the QueryParser constructor? There's no guarantee that anything will every work if you do something crazy (like uppercase all terms when indexing, and lowercase all terms when searching) But lucene exposes that to the devolper and let's them make the choice – likewise Solr happily lets you configure a query analyzer that's completely different from your index analyzer – if that's what you want, that's what you get: being able to set different Version params should be no different. If the QueryParser you are using says that version=X.Y will only work with StopFilter if it's version=X.Y as well that's fine – but maybe you've solved that problem a completely different way with a comppletley alternate implementation of StopFilter (that doesn't care about version). The user should be in control.

        sometimes things interact in ways we cannot detect automatically

        which is why i think it's a bad idea to have a global default for this ... there may be situations where people explicitly want different behavior in different instances (ie: in this field i want the legacy 2.4 StopFilter behavior, but in this field i want the current 2.9 stop filter behavior) and having a default will mask the ability to do this, and make it easy to inadvertantly break it.

        its my understanding that things like this are why Version was created in the first place.

        My understanding is castly different then yours ... All the discussions i remember about it were along the lines of preventing Class proliferation – that people didn't' like the idea of creating StandardAnalyzer2 just because StandardAnalyzer had some behavior that was considered buggy but couldn't be removed - so now there is a constructor arg instead, and static constants that let you pick a fixed behavior, or a constant that let's you pick "current" no matter what it is – so applications that always want the "current recommended behavior" can just upgrade a jar and get it.

        But I don't remember any implication that it was expected that every object would have the same Version settings as every other object – if that was the intention then shouldn't there be a standard interface for "Versionable" or "VersionAware" objects so they can test compatibility with one another (ie: QueryParser and Analyzers that might wrap StopFilter) ? ... or a "public static void setCurrentOperatingVersion(Version) method in the Version class, instead of letting each constructor take in an independent value?


        FWIW: Even though I'm still convinced that having any sort of "global" default value for luceneMatchVersion is a bad idea – and i'm going to keep trying to convince other people as well – I want to make some comments about how i think it should be implemented if we do wind up doing it (just in case i get hit by a bus)

        Making the Base*Factory analysis classses SolrCoreAware is really overkill for this – there was a real conscious choice not to let things declared in schema.xml be SolrCoreAware, because it pulls back the curtain and exposes a lot of plumbing related APIs in way that could make it hard to refactor away SolrCore functionality later. The list of plugin types that can be made SolrCoreAware is deliberately small, and confined to plugins that are already exposed to the full SolrCore API at some other time in their life cycle – being SolrCoreAware just gives them access to the core during initialization.

        If there is really going to be one uber-default global "luceneMatchVersion" then i think the place it makes the most sense to declare something like this is in the schema.xml – many differnet solrconfig.xml files might be used with the same schema.xml, so if we're expecting that the "typical" behavior is to set this once and have it just work it should propogate from the IndexSchema object to the SolrCore and not vice-versa.

        My suggestion for how to implement this would be...

        1. Add a new "luceneMatchVersion" attribute to the existing <schema/> tag.
        2. Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can use this to get the default.
        3. When init()ing new objects, include the key=>value pair of "luceneMatchVersion"=>schema.getLuceneMatchVersion() to the init method of the object if it's not already an init param for that particular instance.

        This would eliminate the need to make any of the Analysis Factories SolrCoreAware (or even ResourceLoaderAware) just to know what the luceneMatchVersion should be – the Base*Factories could still contain a protected Version luceneMatchVersion set by the base init() method that subclasses could use as needed.

        NOTE: This still doesn't doesn't solve the "Analyzers must have no-arg constructors" part of hte issue – but it doesn't make it worse. We can make IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg "Version" constructor fairly easily. If/When we provide a more general mechanism for passing constructor args to Analyzers, any Version params could be defaulted just like with the factory init() methods.

        Show
        Hoss Man added a comment - I don't think Version is intended so you can use X.Y on this part and Y.Z on this part and have any chance of anything working, for example it controls position increments on stopfilter but also in queryparser, if you use wacky combinations, things might not work. How is that any different from letting users pass any Analyzer they want to the QueryParser constructor? There's no guarantee that anything will every work if you do something crazy (like uppercase all terms when indexing, and lowercase all terms when searching) But lucene exposes that to the devolper and let's them make the choice – likewise Solr happily lets you configure a query analyzer that's completely different from your index analyzer – if that's what you want, that's what you get: being able to set different Version params should be no different. If the QueryParser you are using says that version=X.Y will only work with StopFilter if it's version=X.Y as well that's fine – but maybe you've solved that problem a completely different way with a comppletley alternate implementation of StopFilter (that doesn't care about version). The user should be in control. sometimes things interact in ways we cannot detect automatically which is why i think it's a bad idea to have a global default for this ... there may be situations where people explicitly want different behavior in different instances (ie: in this field i want the legacy 2.4 StopFilter behavior, but in this field i want the current 2.9 stop filter behavior) and having a default will mask the ability to do this, and make it easy to inadvertantly break it. its my understanding that things like this are why Version was created in the first place. My understanding is castly different then yours ... All the discussions i remember about it were along the lines of preventing Class proliferation – that people didn't' like the idea of creating StandardAnalyzer2 just because StandardAnalyzer had some behavior that was considered buggy but couldn't be removed - so now there is a constructor arg instead, and static constants that let you pick a fixed behavior, or a constant that let's you pick "current" no matter what it is – so applications that always want the "current recommended behavior" can just upgrade a jar and get it. But I don't remember any implication that it was expected that every object would have the same Version settings as every other object – if that was the intention then shouldn't there be a standard interface for "Versionable" or "VersionAware" objects so they can test compatibility with one another (ie: QueryParser and Analyzers that might wrap StopFilter) ? ... or a " public static void setCurrentOperatingVersion(Version) method in the Version class, instead of letting each constructor take in an independent value? FWIW: Even though I'm still convinced that having any sort of "global" default value for luceneMatchVersion is a bad idea – and i'm going to keep trying to convince other people as well – I want to make some comments about how i think it should be implemented if we do wind up doing it (just in case i get hit by a bus) Making the Base*Factory analysis classses SolrCoreAware is really overkill for this – there was a real conscious choice not to let things declared in schema.xml be SolrCoreAware, because it pulls back the curtain and exposes a lot of plumbing related APIs in way that could make it hard to refactor away SolrCore functionality later. The list of plugin types that can be made SolrCoreAware is deliberately small, and confined to plugins that are already exposed to the full SolrCore API at some other time in their life cycle – being SolrCoreAware just gives them access to the core during initialization. If there is really going to be one uber-default global "luceneMatchVersion" then i think the place it makes the most sense to declare something like this is in the schema.xml – many differnet solrconfig.xml files might be used with the same schema.xml, so if we're expecting that the "typical" behavior is to set this once and have it just work it should propogate from the IndexSchema object to the SolrCore and not vice-versa. My suggestion for how to implement this would be... Add a new "luceneMatchVersion" attribute to the existing <schema/> tag. Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can use this to get the default. When init()ing new objects, include the key=>value pair of "luceneMatchVersion"=>schema.getLuceneMatchVersion() to the init method of the object if it's not already an init param for that particular instance. This would eliminate the need to make any of the Analysis Factories SolrCoreAware (or even ResourceLoaderAware) just to know what the luceneMatchVersion should be – the Base*Factories could still contain a protected Version luceneMatchVersion set by the base init() method that subclasses could use as needed. NOTE: This still doesn't doesn't solve the "Analyzers must have no-arg constructors" part of hte issue – but it doesn't make it worse. We can make IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg "Version" constructor fairly easily. If/When we provide a more general mechanism for passing constructor args to Analyzers, any Version params could be defaulted just like with the factory init() methods.
        Hide
        Robert Muir added a comment -

        which is why i think it's a bad idea to have a global default for this ... there may be situations where people explicitly want different behavior in different instances (ie: in this field i want the legacy 2.4 StopFilter behavior, but in this field i want the current 2.9 stop filter behavior) and having a default will mask the ability to do this, and make it easy to inadvertantly break it.

        but this patch does argue for a global default, which is 2.4, its just hardcoded inside the java code.

        The user should be in control.

        You argue against yourself when you say this, but prevent the user from changing this hardcoded 2.4 default.

        Show
        Robert Muir added a comment - which is why i think it's a bad idea to have a global default for this ... there may be situations where people explicitly want different behavior in different instances (ie: in this field i want the legacy 2.4 StopFilter behavior, but in this field i want the current 2.9 stop filter behavior) and having a default will mask the ability to do this, and make it easy to inadvertantly break it. but this patch does argue for a global default, which is 2.4, its just hardcoded inside the java code. The user should be in control. You argue against yourself when you say this, but prevent the user from changing this hardcoded 2.4 default.
        Hide
        Hoss Man added a comment -

        You argue against yourself when you say this, but prevent the user from changing this hardcoded 2.4 default.

        WTF?!?! ... now i feel like you are just messing with my head.

        I've never argued that the user shouldn't be allowed to change the behavior of any class away from the (hardcoded) 2.4 behavior – i've tried to be very clear that my objection was only to the new "global" default setting that would have action at a distance for all of these Version dependent classes w/o aby obvious indication what it was affect.

        To be as clear as i possibly know how: I am completely in favor of this new syntax added by Uwe's patch...

        src/test/test-files/solr/conf/schema-luceneMatchVersion.xml
          <fieldtype name="text20" class="solr.TextField">
            <analyzer>
              <tokenizer class="solr.StandardTokenizerFactory" luceneMatchVersion="LUCENE_20"/>
              <filter class="solr.StandardFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StopFilterFactory" luceneMatchVersion="LUCENE_24"/>
              <filter class="solr.EnglishPorterFilterFactory"/>
            </analyzer>
          </fieldtype>
        

        ...and this is the only new syntax added by Uwe's patch that i am opposed to...

        src/test/test-files/solr/conf/solrconfig.xml
        <luceneMatchVersion>LUCENE_29</luceneMatchVersion>
        
        Show
        Hoss Man added a comment - You argue against yourself when you say this, but prevent the user from changing this hardcoded 2.4 default. WTF?!?! ... now i feel like you are just messing with my head. I've never argued that the user shouldn't be allowed to change the behavior of any class away from the (hardcoded) 2.4 behavior – i've tried to be very clear that my objection was only to the new "global" default setting that would have action at a distance for all of these Version dependent classes w/o aby obvious indication what it was affect. To be as clear as i possibly know how: I am completely in favor of this new syntax added by Uwe's patch... src/test/test-files/solr/conf/schema-luceneMatchVersion.xml <fieldtype name= "text20" class= "solr.TextField" > <analyzer> <tokenizer class= "solr.StandardTokenizerFactory" luceneMatchVersion= "LUCENE_20" /> <filter class= "solr.StandardFilterFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.StopFilterFactory" luceneMatchVersion= "LUCENE_24" /> <filter class= "solr.EnglishPorterFilterFactory" /> </analyzer> </fieldtype> ...and this is the only new syntax added by Uwe's patch that i am opposed to... src/test/test-files/solr/conf/solrconfig.xml <luceneMatchVersion>LUCENE_29</luceneMatchVersion>
        Hide
        Robert Muir added a comment -

        WTF?!?! ... now i feel like you are just messing with my head.

        I am really not trying to, i guess we have just put some recent work that only happens with Version >= <somethign recent> and it would be a shame if it were never used because we made this too difficult, and it simply falls back on 2.4 and works without this parameter so no one bothers.

        And I also can't see anyone really spending time to aggressively ensure that the example schema etc is all up to date (personally i would try to help, it is difficult though with lucene and solr so out of sync)

        I've never argued that the user shouldn't be allowed to change the behavior of any class away from the (hardcoded) 2.4 behavior - i've tried to be very clear that my objection was only to the new "global" default setting that would have action at a distance for all of these Version dependent classes w/o aby obvious indication what it was affect.

        the hardcoded 2.4 behavior is the action at a distance, because if i do not specify Version in my configuration file, then i get this very old behavior.

        If this is really your concern, then i have an alternative i propose.

        • No default anywhere, not even in the code
        • Version is mandatory if the thing requires it
        Show
        Robert Muir added a comment - WTF?!?! ... now i feel like you are just messing with my head. I am really not trying to, i guess we have just put some recent work that only happens with Version >= <somethign recent> and it would be a shame if it were never used because we made this too difficult, and it simply falls back on 2.4 and works without this parameter so no one bothers. And I also can't see anyone really spending time to aggressively ensure that the example schema etc is all up to date (personally i would try to help, it is difficult though with lucene and solr so out of sync) I've never argued that the user shouldn't be allowed to change the behavior of any class away from the (hardcoded) 2.4 behavior - i've tried to be very clear that my objection was only to the new "global" default setting that would have action at a distance for all of these Version dependent classes w/o aby obvious indication what it was affect. the hardcoded 2.4 behavior is the action at a distance, because if i do not specify Version in my configuration file, then i get this very old behavior. If this is really your concern, then i have an alternative i propose. No default anywhere, not even in the code Version is mandatory if the thing requires it
        Hide
        Uwe Schindler added a comment -

        My suggestion for how to implement this would be...

        1. Add a new "luceneMatchVersion" attribute to the existing <schema/> tag.
        2. Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can use this to get the default.
        3. When init()ing new objects, include the key=>value pair of "luceneMatchVersion"=>schema.getLuceneMatchVersion() to the init method of the object if it's not already an init param for that particular instance.

        This would eliminate the need to make any of the Analysis Factories SolrCoreAware (or even ResourceLoaderAware) just to know what the luceneMatchVersion should be – the Base*Factories could still contain a protected Version luceneMatchVersion set by the base init() method that subclasses could use as needed.

        NOTE: This still doesn't doesn't solve the "Analyzers must have no-arg constructors" part of hte issue – but it doesn't make it worse. We can make IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg "Version" constructor fairly easily. If/When we provide a more general mechanism for passing constructor args to Analyzers, any Version params could be defaulted just like with the factory init() methods.

        That was my proposal a few comments above. But: I still do not want it in schema.xml, as Version is a global Lucene thing! But the behaviour would be the same: The schema code can get the version from somewhere and pass it down to all schema components as you propose.

        The Analyzers must have no-arg ctor is easy: Use reflection and look first for a ctor with Version, if exist use and pass ctor init/schema/config arg, if not exisatent use no-arg ctor. We already have this in Lucene's benchmark contrib since 3.0.

        Show
        Uwe Schindler added a comment - My suggestion for how to implement this would be... Add a new "luceneMatchVersion" attribute to the existing <schema/> tag. Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can use this to get the default. When init()ing new objects, include the key=>value pair of "luceneMatchVersion"=>schema.getLuceneMatchVersion() to the init method of the object if it's not already an init param for that particular instance. This would eliminate the need to make any of the Analysis Factories SolrCoreAware (or even ResourceLoaderAware) just to know what the luceneMatchVersion should be – the Base*Factories could still contain a protected Version luceneMatchVersion set by the base init() method that subclasses could use as needed. NOTE: This still doesn't doesn't solve the "Analyzers must have no-arg constructors" part of hte issue – but it doesn't make it worse. We can make IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg "Version" constructor fairly easily. If/When we provide a more general mechanism for passing constructor args to Analyzers, any Version params could be defaulted just like with the factory init() methods. That was my proposal a few comments above. But: I still do not want it in schema.xml, as Version is a global Lucene thing! But the behaviour would be the same: The schema code can get the version from somewhere and pass it down to all schema components as you propose. The Analyzers must have no-arg ctor is easy: Use reflection and look first for a ctor with Version, if exist use and pass ctor init/schema/config arg, if not exisatent use no-arg ctor. We already have this in Lucene's benchmark contrib since 3.0.
        Hide
        Hoss Man added a comment -

        And I also can't see anyone really spending time to aggressively ensure that the example schema etc is all up to date

        I think you are vastly underestimating how much work is spent reviewing the example schema.xml prior to releases. It would be trivial to search/replace luceneMatchVersion="X" with luceneMatchVersion="Y" anytime the "current" version of Version was updated in Lucene-Java

        the hardcoded 2.4 behavior is the action at a distance, because if i do not specify Version in my configuration file, then i get this very old behavior.

        I don't follow you at all – you have identified no action, or distance in your example.

        When i say i'm worried about scary action at a distance, i'm talking about editing some thing A in a config file, and having it result in changed behavior (action) in things B, C and D that do not directly refer to A in any way (distance). Further more these changes in behavior are silent (thus scary).

        If I have <fieldType name="A"/> and much later in the config <field name="B" type="A"/> the editing A results in and action on B at a distance – but this should not suprise me at all because B explicitly refrences A.

        Having a global <luceneMatchVersion/> tag that affects the behavior of a variety of different things when it's modified leads to situations where people might change that value triggering changes in many components w/o a clear idea of what might have changed – so they don't even know what things they should focus on testing for correctness after makign that change.

        The existing <schema version="X"/> property also leads to action at a distance type situations – but that is a lot less scary to me because at least with it there is a uniform set of changes to all schema objects between any two versions, so it's easy to document what cahnges when you go from 1.1 to 1.2, or 1.2 to 1.3 ... but with luceneMatchVersion the potential changes are unique to every individual Class that cares about it.

        If this is really your concern, then i have an alternative i propose.

        • No default anywhere, not even in the code
        • Version is mandatory if the thing requires it

        This is something Uwe and i both discussed in previous comments...

        https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796872#action_12796872
        https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796937#action_12796937

        ...as i said: i'm fine with this idea in theory – as a long term plan – but there has to be a gradual migration process for people. ie: it can be required on certain objects in a future release, but for at least the next release it needs to be possible to not specify the luceneMatchVersion on all of these objects, and when people use them w/o specifying, they can log big fat warnings on initi that it is defaulting to 2.4, and they should set the property explicitly if that's what they want.


        I still do not want it in schema.xml, as Version is a global Lucene thing!

        Uwe: I think you are missunderstanding the reason for a distinction between solrconfig.xml and schema.xml in Solr. If (for hte sake of argument) luceneMatchVersion really should be a "global Lucene thing" then that is precisely why it should be in schema.xml.

        schema.xml is for configuration that is inheriently part of the index, and must be consistent regardless of who/how/why that index is being used. solrconfig.xml is where settings are put that are specific to how a a particular instance of an index is being used. If a setting is in solrconfig.xml, then it should to be possible for that setting to be completley different on differnet solr instances that use the exact same schema.xml – even if they use cloned copies of the same index directory. (ie: master/slave distinctions in replication; peer slaves with distinct handler/cache settings to serve distinct use cases; etc...)

        That's the reason why nothing that hangs off of IndexSchema is currently allowed to be SolrCoreAware, or get access to the SolrConfig object (and the SolrResourceLoader abstraction was created) ... nothing about the SolrCore "instance" should be allowed to influence the resulting index, because that index may later be used on a differnet instance with a different config.

        As i mentioned before: solrconfig.xml can depend on schema.xml, but schema.xml can not depend on solrconfig.xml

        So if a global luceneMatchVersion can affect the behavior of an analyzer or FieldType in a way that is "persisted" as part of hte index – and other classes (like QueryParser in Robert's example) need to make sure to use the same luceneMatchVersion to behave correctly with that index, then that setting needs to be in the schema.xml so it is consistent no matter how/where that index and schema.xml file are used.

        Does that make sense?


        I'd still like to clarify this whole issue of wether "Lucene-Java", as a project, has an expectation that client applications will always use a consistent value for Version when constructing objects that interact with an index, as Robert alluded to in a previous comment...

        I don't think Version is intended so you can use X.Y on this part and Y.Z on this part

        This was not my impression when Version was added – but i freely admit I wasn' paying that much attention.

        In Uwe's comment he implied (but didn't actually state) that he concurred with Robert...

        ...Version is a global Lucene thing...

        Iff that expectation really is true in Lucnee-Java, and iff there really is an expectation that using multiple Version values withing Solr is likely to cause people problems as objects interact, then it seems to be that it be a very bad idea to offer to any sort of out of the box support for per object overriding of luceneMatchVersion in our solrconfig.xml/schema.xml.

        i know, i know ... this is a complete 180 from my previous claim that we should only have per object configuration – a claim that i still stand behind if Lucene-Java "supports" applications using multiple values of Version, but if that is not considered "supported" and if changes are actively being made in Lucene-Java that explicitly assume consistent Version usage, then I'm not convinced it owuld be a good idea to enable people to tweak things in that way. Anyone who understands the underlying Java code enough to appreciate the nuances of using A.B in one place and X.Y in another place can write their own Factory that looks at a luceneMatchVersion nit param – the out of hte box ones should stick with the global setting.

        BUT!!!!! ... those are Big "IFFs" ...

        • Uwe: do you concur with Robert?
        • Are there any threads/docs about the expecations of Version homo/hetero-genousness in Lucene-Java?
        Show
        Hoss Man added a comment - And I also can't see anyone really spending time to aggressively ensure that the example schema etc is all up to date I think you are vastly underestimating how much work is spent reviewing the example schema.xml prior to releases. It would be trivial to search/replace luceneMatchVersion="X" with luceneMatchVersion="Y" anytime the "current" version of Version was updated in Lucene-Java the hardcoded 2.4 behavior is the action at a distance, because if i do not specify Version in my configuration file, then i get this very old behavior. I don't follow you at all – you have identified no action, or distance in your example. When i say i'm worried about scary action at a distance, i'm talking about editing some thing A in a config file, and having it result in changed behavior (action) in things B, C and D that do not directly refer to A in any way (distance). Further more these changes in behavior are silent (thus scary). If I have <fieldType name="A"/> and much later in the config <field name="B" type="A"/> the editing A results in and action on B at a distance – but this should not suprise me at all because B explicitly refrences A. Having a global <luceneMatchVersion/> tag that affects the behavior of a variety of different things when it's modified leads to situations where people might change that value triggering changes in many components w/o a clear idea of what might have changed – so they don't even know what things they should focus on testing for correctness after makign that change. The existing <schema version="X"/> property also leads to action at a distance type situations – but that is a lot less scary to me because at least with it there is a uniform set of changes to all schema objects between any two versions, so it's easy to document what cahnges when you go from 1.1 to 1.2, or 1.2 to 1.3 ... but with luceneMatchVersion the potential changes are unique to every individual Class that cares about it. If this is really your concern, then i have an alternative i propose. No default anywhere, not even in the code Version is mandatory if the thing requires it This is something Uwe and i both discussed in previous comments... https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796872#action_12796872 https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796937#action_12796937 ...as i said: i'm fine with this idea in theory – as a long term plan – but there has to be a gradual migration process for people. ie: it can be required on certain objects in a future release, but for at least the next release it needs to be possible to not specify the luceneMatchVersion on all of these objects, and when people use them w/o specifying, they can log big fat warnings on initi that it is defaulting to 2.4, and they should set the property explicitly if that's what they want. I still do not want it in schema.xml, as Version is a global Lucene thing! Uwe: I think you are missunderstanding the reason for a distinction between solrconfig.xml and schema.xml in Solr. If (for hte sake of argument) luceneMatchVersion really should be a "global Lucene thing" then that is precisely why it should be in schema.xml. schema.xml is for configuration that is inheriently part of the index, and must be consistent regardless of who/how/why that index is being used. solrconfig.xml is where settings are put that are specific to how a a particular instance of an index is being used. If a setting is in solrconfig.xml, then it should to be possible for that setting to be completley different on differnet solr instances that use the exact same schema.xml – even if they use cloned copies of the same index directory. (ie: master/slave distinctions in replication; peer slaves with distinct handler/cache settings to serve distinct use cases; etc...) That's the reason why nothing that hangs off of IndexSchema is currently allowed to be SolrCoreAware, or get access to the SolrConfig object (and the SolrResourceLoader abstraction was created) ... nothing about the SolrCore "instance" should be allowed to influence the resulting index, because that index may later be used on a differnet instance with a different config. As i mentioned before: solrconfig.xml can depend on schema.xml, but schema.xml can not depend on solrconfig.xml So if a global luceneMatchVersion can affect the behavior of an analyzer or FieldType in a way that is "persisted" as part of hte index – and other classes (like QueryParser in Robert's example) need to make sure to use the same luceneMatchVersion to behave correctly with that index, then that setting needs to be in the schema.xml so it is consistent no matter how/where that index and schema.xml file are used. Does that make sense? I'd still like to clarify this whole issue of wether "Lucene-Java", as a project, has an expectation that client applications will always use a consistent value for Version when constructing objects that interact with an index, as Robert alluded to in a previous comment... I don't think Version is intended so you can use X.Y on this part and Y.Z on this part This was not my impression when Version was added – but i freely admit I wasn' paying that much attention. In Uwe's comment he implied (but didn't actually state) that he concurred with Robert... ...Version is a global Lucene thing... Iff that expectation really is true in Lucnee-Java, and iff there really is an expectation that using multiple Version values withing Solr is likely to cause people problems as objects interact, then it seems to be that it be a very bad idea to offer to any sort of out of the box support for per object overriding of luceneMatchVersion in our solrconfig.xml/schema.xml. i know, i know ... this is a complete 180 from my previous claim that we should only have per object configuration – a claim that i still stand behind if Lucene-Java "supports" applications using multiple values of Version, but if that is not considered "supported" and if changes are actively being made in Lucene-Java that explicitly assume consistent Version usage, then I'm not convinced it owuld be a good idea to enable people to tweak things in that way. Anyone who understands the underlying Java code enough to appreciate the nuances of using A.B in one place and X.Y in another place can write their own Factory that looks at a luceneMatchVersion nit param – the out of hte box ones should stick with the global setting. BUT!!!!! ... those are Big "IFFs" ... Uwe: do you concur with Robert? Are there any threads/docs about the expecations of Version homo/hetero-genousness in Lucene-Java?
        Hide
        Marvin Humphrey added a comment -

        > I'd still like to clarify this whole issue of wether "Lucene-Java", as a project,
        > has an expectation that client applications will always use a consistent value
        > for Version when constructing objects that interact with an index

        Yes. The whole point is to avoid Analyzer mismatches.

        Say a stoplist was modified between Lucene versions. Sure, you can hack it
        and ask for an old match version, so you get a stoplist other than the one that
        was used to build the index... but why would you want to?

        > Are there any threads/docs about the expecations of Version
        > homo/hetero-genousness in Lucene-Java?

        The original thread from last May, I guess... which culminated in LUCENE-1684:

        http://markmail.org/thread/egqe6rm4c4om7swv

        It's very long, though.

        Show
        Marvin Humphrey added a comment - > I'd still like to clarify this whole issue of wether "Lucene-Java", as a project, > has an expectation that client applications will always use a consistent value > for Version when constructing objects that interact with an index Yes. The whole point is to avoid Analyzer mismatches. Say a stoplist was modified between Lucene versions. Sure, you can hack it and ask for an old match version, so you get a stoplist other than the one that was used to build the index... but why would you want to? > Are there any threads/docs about the expecations of Version > homo/hetero-genousness in Lucene-Java? The original thread from last May, I guess... which culminated in LUCENE-1684 : http://markmail.org/thread/egqe6rm4c4om7swv It's very long, though.
        Hide
        Hoss Man added a comment -

        Yes. The whole point is to avoid Analyzer mismatches.

        Say a stoplist was modified between Lucene versions. Sure, you can hack it
        and ask for an old match version, so you get a stoplist other than the one that
        was used to build the index... but why would you want to?

        ...but that's no different then using StopFilter(someStopWordSet) at indexing and StopFilter(someOtherStopWordSet) at query time – Solr happily lets you do that with it's index/query analyzers ... you may have a very good reason for doing that. Likewise you may have an existing field using the "default" stopwords list from Version.LUCENE_24 that you don't want to change because you want clients that search on that field to continue to get the same behavior, but when you add a new field you want it to have the current default stopwords because it's queried by entirely different clients.

        That's no differernet then saying i want PorterStemmer on fieldA and SnowBall2Stemmer on fieldB.

        The implication i got from Robert was that there was (or would soon be) expectations in Lucene-Java code that if one object was told to use Version.X it wold be assumed that every other object in the application was using Version.X.

        To be that's the crux of the whole issue: If that is the expectation Lucene-Java has, then we should have a single global config for luceneMatchVersion and not support per-object configuration. If that is not the expectation, then we should not have a global luceneMatchVersion.

        Show
        Hoss Man added a comment - Yes. The whole point is to avoid Analyzer mismatches. Say a stoplist was modified between Lucene versions. Sure, you can hack it and ask for an old match version, so you get a stoplist other than the one that was used to build the index... but why would you want to? ...but that's no different then using StopFilter(someStopWordSet) at indexing and StopFilter(someOtherStopWordSet) at query time – Solr happily lets you do that with it's index/query analyzers ... you may have a very good reason for doing that. Likewise you may have an existing field using the "default" stopwords list from Version.LUCENE_24 that you don't want to change because you want clients that search on that field to continue to get the same behavior, but when you add a new field you want it to have the current default stopwords because it's queried by entirely different clients. That's no differernet then saying i want PorterStemmer on fieldA and SnowBall2Stemmer on fieldB. The implication i got from Robert was that there was (or would soon be) expectations in Lucene-Java code that if one object was told to use Version.X it wold be assumed that every other object in the application was using Version.X. To be that's the crux of the whole issue: If that is the expectation Lucene-Java has, then we should have a single global config for luceneMatchVersion and not support per-object configuration. If that is not the expectation, then we should not have a global luceneMatchVersion.
        Hide
        Robert Muir added a comment -

        The implication i got from Robert was that there was (or would soon be) expectations in Lucene-Java code that if one object was told to use Version.X it wold be assumed that every other object in the application was using Version.X.

        Hoss, I didn't mean to imply any such thing, just that i don't see any tests (or the framework for testing such behavior), so even if its officially supported, in my opinion it does not exist.

        For example, as far as analysis goes, my personal opinion is that in any given package (say one language, or whatever), we will test the entire Analyzer against Version X, and will test that back compat works for Version Y, Z, etc.

        But i personally can't see myself ensuring the all the underlying tokenstreams (maybe this language uses 5 lets say), works across all the permutations of different versions

        { X, Y, Z }

        you can apply, its simply asking too much.

        Show
        Robert Muir added a comment - The implication i got from Robert was that there was (or would soon be) expectations in Lucene-Java code that if one object was told to use Version.X it wold be assumed that every other object in the application was using Version.X. Hoss, I didn't mean to imply any such thing, just that i don't see any tests (or the framework for testing such behavior), so even if its officially supported, in my opinion it does not exist. For example, as far as analysis goes, my personal opinion is that in any given package (say one language, or whatever), we will test the entire Analyzer against Version X, and will test that back compat works for Version Y, Z, etc. But i personally can't see myself ensuring the all the underlying tokenstreams (maybe this language uses 5 lets say), works across all the permutations of different versions { X, Y, Z } you can apply, its simply asking too much.
        Hide
        Mark Miller added a comment -

        In my opinion this should be real simple. Having to specify a Lucene version for each component is not simple - its beyond most users. I think its beyond me (laugh as you see fit). Having to accept Lucene 2.4 behavior by default because of Solr back compat issues is also "weak". A new user should get all the bug fixes of the latest Lucene with minimal effort. Hopefully no effort. Older users should be able to get the newest with minimal effort as well - not having to go one by one through each component and upgrading it. I can't imagine juggling all these versions for each component - thats ugly enough in Lucene - it shouldn't infect Solr for the average case.

        Personally, I do think there should be a global default. And I think right next to it, it should say, if you change this, you must reindex. No worries about action at a distance. The action is to get the latest and greatest Lucene has to offer rather than older buggy or back compat behavior. Reindex, get latest greatest. Don't reindex and your on your own. Solr might rip your head off.

        We should also offer per component for real experts, but I wouldn't be meddling that way myself unless in a bind. Solr should be real simple about this - and the latest Solr should use the latest bug fixes from Lucene, with previous configs out there defaulting to 2.4 compatibility.

        I abbreviated the heck out of my arguments and thinking, but damn it thats what I think

        Show
        Mark Miller added a comment - In my opinion this should be real simple. Having to specify a Lucene version for each component is not simple - its beyond most users. I think its beyond me (laugh as you see fit). Having to accept Lucene 2.4 behavior by default because of Solr back compat issues is also "weak". A new user should get all the bug fixes of the latest Lucene with minimal effort. Hopefully no effort. Older users should be able to get the newest with minimal effort as well - not having to go one by one through each component and upgrading it. I can't imagine juggling all these versions for each component - thats ugly enough in Lucene - it shouldn't infect Solr for the average case. Personally, I do think there should be a global default. And I think right next to it, it should say, if you change this, you must reindex. No worries about action at a distance. The action is to get the latest and greatest Lucene has to offer rather than older buggy or back compat behavior. Reindex, get latest greatest. Don't reindex and your on your own. Solr might rip your head off. We should also offer per component for real experts, but I wouldn't be meddling that way myself unless in a bind. Solr should be real simple about this - and the latest Solr should use the latest bug fixes from Lucene, with previous configs out there defaulting to 2.4 compatibility. I abbreviated the heck out of my arguments and thinking, but damn it thats what I think
        Hide
        Hoss Man added a comment -

        I'm definitely of two minds on this.

        On the one hand...

        Robert's clarification of his concerns convinces me that we don't need a global setting. The issue of multiple related components in an analysis chain (ie: EsperantoTokenizer, EsperantoStopFilter, and EsperantoStemmerFilter) not being well tested in Lucene-Java when those components use differnet Version proeprties doesn't seem like a compelling argument because we've never made any claims that any combinations of analysis componets will work together. People can easily construct Analyzers in their schema.xml that make no sense, and don't work at all, we'll never be able to solve that problem for everyone. Worrying about people miss-matching version numbers doesn't seem any different then worrying about them using inconsistent stopword files between an index analyzer and a query analyzer on the same field: buyer beware.

        On the other hand...

        I view the Version property of all these Lucene-Java classes an as implementation detail of the generalized ideal of providing multiple solutions for a similar problem that have subtly differnet behavior. To my mind: Adding a version property to StandardTokenizer is just an alternate approach to deprecating StandardTokenizer and providing a new StadanrdTokenizer2 where the behavior is "improved" based on the subjective opinion of the Lucene community. The Version property approach is easier to maintain in the Lucene source tree, but still requires roughly the same amount of work on the part of client app maintainers when upgrading: consider whether you think the "improved" behavior is better for your application, and modify your code as needed. I've been looking at how this should be supported in Solr with that perspective, putting the schema.xml owner in the role of the client app maintainer.

        But I'm realizing now that I'm clearly in the minority in viewing these multiple versions as "alternate implementations" ... everyone else seems to have a very fixed view that these Version based changes are genuine improvements/bug-fixes, w/o any expectation that clients might/could subjective decide "i want the old behavior" and that older "Versions" are supported purely for back-compatibility.

        If that's how Version is really going to be used in Lucene-Java moving forward, then I can definitely understand the push for having it globally configured in Solr for simplification.


        I won't fight you guys on this ... if I'm the only one that feels like a global value is bad, then i concede that probably says more about me then about the idea.

        But I'm still really worried about the problem of (opaque) action at a distance, and the difficulties in understanding what effects there will be when changing the luceneVersionMatch property from one value to another.

        This comment from Mark illustrates what scares me the most...

        it should say, if you change this, you must reindex. No worries about action at a distance. The action is to get the latest and greatest Lucene has to offer rather than older buggy or back compat behavior.

        ...that mindset, that as long as you reindex you'll be fine, totally downplays the fact that changes will happen in places the user may not realize. w/o a clear way of knowing what exactly is changing when you modify that (global) value, users will have no idea what to look for when they "upgrade" it. they won't have any visibility into what the fully set of behavior changes to exepect as a result of that update, to know what they should test to make sure it still works the way they need it to.

        If they read in mailing list thread that they need to switch from <luceneMatchVersion>2.4</luceneMatchVersion> to <luceneMatchVersion>2.9</luceneMatchVersion> and completley reindex in order to get positions to be preserved in StopFilterFactory, that doesn't help them realize that they should do relevancy testing on fieldA and fieldB which use some language specific stemmer whose behavior changed in a small but significant way.

        As a user, that's the nightmare scenario i don't want to have to deal with: greping through every class in Lucene-Java that has a Version property to see which ones have differnet behavior between the luceneMatchVersion property i'm currently using and the luceneMatchVersion property i've been told i should upgrade to in order to fix a bug ... just so i know what things i need to test after i make my change.

        I guess this is will just be a documentation problem, but it seems like a pretty fucking big one.

        Show
        Hoss Man added a comment - I'm definitely of two minds on this. On the one hand... Robert's clarification of his concerns convinces me that we don't need a global setting. The issue of multiple related components in an analysis chain (ie: EsperantoTokenizer, EsperantoStopFilter, and EsperantoStemmerFilter) not being well tested in Lucene-Java when those components use differnet Version proeprties doesn't seem like a compelling argument because we've never made any claims that any combinations of analysis componets will work together. People can easily construct Analyzers in their schema.xml that make no sense, and don't work at all, we'll never be able to solve that problem for everyone. Worrying about people miss-matching version numbers doesn't seem any different then worrying about them using inconsistent stopword files between an index analyzer and a query analyzer on the same field: buyer beware. On the other hand... I view the Version property of all these Lucene-Java classes an as implementation detail of the generalized ideal of providing multiple solutions for a similar problem that have subtly differnet behavior. To my mind: Adding a version property to StandardTokenizer is just an alternate approach to deprecating StandardTokenizer and providing a new StadanrdTokenizer2 where the behavior is "improved" based on the subjective opinion of the Lucene community. The Version property approach is easier to maintain in the Lucene source tree, but still requires roughly the same amount of work on the part of client app maintainers when upgrading: consider whether you think the "improved" behavior is better for your application, and modify your code as needed. I've been looking at how this should be supported in Solr with that perspective, putting the schema.xml owner in the role of the client app maintainer. But I'm realizing now that I'm clearly in the minority in viewing these multiple versions as "alternate implementations" ... everyone else seems to have a very fixed view that these Version based changes are genuine improvements/bug-fixes, w/o any expectation that clients might/could subjective decide "i want the old behavior" and that older "Versions" are supported purely for back-compatibility. If that's how Version is really going to be used in Lucene-Java moving forward, then I can definitely understand the push for having it globally configured in Solr for simplification. I won't fight you guys on this ... if I'm the only one that feels like a global value is bad, then i concede that probably says more about me then about the idea. But I'm still really worried about the problem of (opaque) action at a distance, and the difficulties in understanding what effects there will be when changing the luceneVersionMatch property from one value to another. This comment from Mark illustrates what scares me the most... it should say, if you change this, you must reindex. No worries about action at a distance. The action is to get the latest and greatest Lucene has to offer rather than older buggy or back compat behavior. ...that mindset, that as long as you reindex you'll be fine, totally downplays the fact that changes will happen in places the user may not realize. w/o a clear way of knowing what exactly is changing when you modify that (global) value, users will have no idea what to look for when they "upgrade" it. they won't have any visibility into what the fully set of behavior changes to exepect as a result of that update, to know what they should test to make sure it still works the way they need it to. If they read in mailing list thread that they need to switch from <luceneMatchVersion>2.4</luceneMatchVersion> to <luceneMatchVersion>2.9</luceneMatchVersion> and completley reindex in order to get positions to be preserved in StopFilterFactory, that doesn't help them realize that they should do relevancy testing on fieldA and fieldB which use some language specific stemmer whose behavior changed in a small but significant way. As a user, that's the nightmare scenario i don't want to have to deal with: greping through every class in Lucene-Java that has a Version property to see which ones have differnet behavior between the luceneMatchVersion property i'm currently using and the luceneMatchVersion property i've been told i should upgrade to in order to fix a bug ... just so i know what things i need to test after i make my change. I guess this is will just be a documentation problem, but it seems like a pretty fucking big one.
        Hide
        Robert Muir added a comment -

        Hi Hoss Man,

        I think I am slightly offended with some of your statements about 'subjective opinion of the Lucene Community' and 'they should do relevancy testing which use some language-specific stemmer whose behavior changed in a small but significant way'.

        I've personally restricted my contributions of language support to those I have either personally relevance tested, or developing from published relevance results. These results are all listed on each JIRA ticket (MAP values and such). I can give you a list of all these issues if you want.

        As far as changing stemmers, we have never done this.
        The only "stemmer changing" I have proposed is fixing bugs, where I have taken the snowball test data and found either bugs in snowball or duplicate implementations we have in our own source tree.
        And to "fix the bugs" I have only proposed that we simply use snowball itself rather than some duplicate, buggy hand-coded implementatation.

        So I'm a little confused about what you are referring to... some theoretical situation?

        Show
        Robert Muir added a comment - Hi Hoss Man, I think I am slightly offended with some of your statements about 'subjective opinion of the Lucene Community' and 'they should do relevancy testing which use some language-specific stemmer whose behavior changed in a small but significant way'. I've personally restricted my contributions of language support to those I have either personally relevance tested, or developing from published relevance results. These results are all listed on each JIRA ticket (MAP values and such). I can give you a list of all these issues if you want. As far as changing stemmers, we have never done this. The only "stemmer changing" I have proposed is fixing bugs, where I have taken the snowball test data and found either bugs in snowball or duplicate implementations we have in our own source tree. And to "fix the bugs" I have only proposed that we simply use snowball itself rather than some duplicate, buggy hand-coded implementatation. So I'm a little confused about what you are referring to... some theoretical situation?
        Hide
        Mark Miller added a comment -

        If you are thinking of VERSION as alternate versions, I can see your point.

        But I can't imagine thats what VERSION is for.

        everyone else seems to have a very fixed view that these Version based changes are genuine improvements/bug-fixes, w/o any expectation that clients might/could subjective decide "i want the old behavior" and that older "Versions" are supported purely for back-compatibility.

        I don't think Versions is meant to be used so that users can choose how things operate - personally I do see it as purely a way to get bad behavior for back compatibility. If thats not the case, we should not use Version in Lucene, we should make a Class2. Then you pick which you want. To me, Version is for fixing bugs or things that are clearly not the right way of doing things. Not a choice list. If more than one choice makes sense that should be done without Version. Personally thats all that makes sense to me. Perhaps it will be abused, but personally I'd push back. Version is not a functionality selector - its a way to handle back compat for bugs and clear improvements - stuff we plan and hope to drop into a big black hole forever. Not "options" that make sense and we plan to keep around for users to mull over.

        I'm also not that worried that users won't know what changed - they will just know that they are in the same boat as those downloading Lucene latest greatest for the first time. Likely the best boat to be in when it comes to this stuff. If they want to manage things piece mail, I'm still all for allowing Version per component for experts use. But man, I wouldn't want to be in the boat, managing all my components as they mimic various bugs/bad behavior for various components.

        When I download the latest Solr and do a fresh install, I want it to have all of the latest Lucene bugs fixed (not the case currently). When I have an old install, I want to be able to change one setting and reindex to get all known bugs fixed (currently not the case - heck its not even possible to run Solr currently with all the known Lucene bugs fixed).

        Show
        Mark Miller added a comment - If you are thinking of VERSION as alternate versions, I can see your point. But I can't imagine thats what VERSION is for. everyone else seems to have a very fixed view that these Version based changes are genuine improvements/bug-fixes, w/o any expectation that clients might/could subjective decide "i want the old behavior" and that older "Versions" are supported purely for back-compatibility. I don't think Versions is meant to be used so that users can choose how things operate - personally I do see it as purely a way to get bad behavior for back compatibility. If thats not the case, we should not use Version in Lucene, we should make a Class2. Then you pick which you want. To me, Version is for fixing bugs or things that are clearly not the right way of doing things. Not a choice list. If more than one choice makes sense that should be done without Version. Personally thats all that makes sense to me. Perhaps it will be abused, but personally I'd push back. Version is not a functionality selector - its a way to handle back compat for bugs and clear improvements - stuff we plan and hope to drop into a big black hole forever. Not "options" that make sense and we plan to keep around for users to mull over. I'm also not that worried that users won't know what changed - they will just know that they are in the same boat as those downloading Lucene latest greatest for the first time. Likely the best boat to be in when it comes to this stuff. If they want to manage things piece mail, I'm still all for allowing Version per component for experts use. But man, I wouldn't want to be in the boat, managing all my components as they mimic various bugs/bad behavior for various components. When I download the latest Solr and do a fresh install, I want it to have all of the latest Lucene bugs fixed (not the case currently). When I have an old install, I want to be able to change one setting and reindex to get all known bugs fixed (currently not the case - heck its not even possible to run Solr currently with all the known Lucene bugs fixed).
        Hide
        Hoss Man added a comment -

        I think I am slightly offended with some of your statements about 'subjective opinion of the Lucene Community' and 'they should do relevancy testing which use some language-specific stemmer whose behavior changed in a small but significant way'.

        That was not at all my intention, i'm sorry about that. I was in fact trying to speak entirely in generalities and theoretical examples.

        The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes – we're not fixing bugs where 1+1=3. Even if everyone on java-dev, and java-user agrees that behavior A is broken and behavior B is correct, that is still (to me) a subjective opinion – 1000 mens trash may be one mans treasure, and there could be users out there who have come to expect/rely on that behavior A.

        I tried to use a stemmer as an example because it's the type of class where making behavior more correct (ie: making the stemming match the semantics of the language more accurately) doesn't necessarily improve the percieved behavior for all users – someone could be very happy with the "sloppy stemming" in the 3.1 version of a (hypothetical) EsperantoStemmer because it gives him really "loose" matches. And if you (or any one else) put in a lot of hard work making that stemmer "better" my all concievable metrics in 3.4, then i've got no problem telling that person "Sorry dude, if you don't want those fixes don't upgrade, or here are some other suggestions for getting 'loose' matching on that field."

        My concern is that there may be people who don't even realize they are depending on behavior like this. Without an easy way for users to understand what objects have improved/fixed behavior between luceneMatchVersion=X and luceneMatchVersion=Y they won't know the full list of things they should be considering/testing when they do change luceneMatchVersion.

        I'm also not that worried that users won't know what changed - they will just know that they are in the same boat as those downloading Lucene latest greatest for the first time.

        But that's not true: a person downloading for the first time won't have any preconcieved expectaionts of how something will behavior; that's a very different boat from a person upgrading is going to expect things that were working to keep working – those things may have actaully been bugs in earlier versions, but if they seemed to be working for their use cases, it's going to feel like it's broken when the behavior changes. For a user who is conciously upgrading i'm ok with that. but when there is no easy way of knowing what behavior will change as a result of setting luceneMatchVersion=X that doens't feel fair to the user.

        Robert mentioned in an earlier comment that StopFilter's position increment behavior changes depending on the luceneMatchVersion – what if an existing Solr 1.3 user notices a bug in some Tokenizer, and adds <luceneMatchVersion>3.0</luceneMatchVersion> to his schema.xml to fix it. Without clear documentation n everything that is affected when doing that, he may not realize that StopFilter changed at all – and even though the position incrememnt behavior may now be more correct, it might drasticly change the results he gets when using dismax with a particular qs or ps value. Hence my point that this becomes a serious documentation concern: finding a way to make it clear to users what they need to consider when modifying luceneMatchVersion.

        I'm still all for allowing Version per component for experts use. But man, I wouldn't want to be in the boat, managing all my components as they mimic various bugs/bad behavior for various components.

        But if the example configs only show a global setting that isn't directly "linked" to any of hte individual object configurations, then normal users won't have any idea what could have/use individual luceneMatchVerssion settings anyway (even if they wanted to manage it piecemeal)

        Like i said: i've come around to the idea of having/advocating a global value. Once i got passed my mistaken thinking of "Version" as controlling "alternate versions" (as miller very clearly put it) I started to understand what you are all saying and i agree with you: a single global value is a good idea.

        My concern is just how to document things so that people don't get confused when they do need to change it.

        Show
        Hoss Man added a comment - I think I am slightly offended with some of your statements about 'subjective opinion of the Lucene Community' and 'they should do relevancy testing which use some language-specific stemmer whose behavior changed in a small but significant way'. That was not at all my intention, i'm sorry about that. I was in fact trying to speak entirely in generalities and theoretical examples. The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes – we're not fixing bugs where 1+1=3. Even if everyone on java-dev, and java-user agrees that behavior A is broken and behavior B is correct, that is still (to me) a subjective opinion – 1000 mens trash may be one mans treasure, and there could be users out there who have come to expect/rely on that behavior A. I tried to use a stemmer as an example because it's the type of class where making behavior more correct (ie: making the stemming match the semantics of the language more accurately) doesn't necessarily improve the percieved behavior for all users – someone could be very happy with the "sloppy stemming" in the 3.1 version of a (hypothetical) EsperantoStemmer because it gives him really "loose" matches. And if you (or any one else) put in a lot of hard work making that stemmer "better" my all concievable metrics in 3.4, then i've got no problem telling that person "Sorry dude, if you don't want those fixes don't upgrade, or here are some other suggestions for getting 'loose' matching on that field." My concern is that there may be people who don't even realize they are depending on behavior like this. Without an easy way for users to understand what objects have improved/fixed behavior between luceneMatchVersion=X and luceneMatchVersion=Y they won't know the full list of things they should be considering/testing when they do change luceneMatchVersion. I'm also not that worried that users won't know what changed - they will just know that they are in the same boat as those downloading Lucene latest greatest for the first time. But that's not true: a person downloading for the first time won't have any preconcieved expectaionts of how something will behavior; that's a very different boat from a person upgrading is going to expect things that were working to keep working – those things may have actaully been bugs in earlier versions, but if they seemed to be working for their use cases, it's going to feel like it's broken when the behavior changes. For a user who is conciously upgrading i'm ok with that. but when there is no easy way of knowing what behavior will change as a result of setting luceneMatchVersion=X that doens't feel fair to the user. Robert mentioned in an earlier comment that StopFilter's position increment behavior changes depending on the luceneMatchVersion – what if an existing Solr 1.3 user notices a bug in some Tokenizer, and adds <luceneMatchVersion>3.0</luceneMatchVersion> to his schema.xml to fix it. Without clear documentation n everything that is affected when doing that, he may not realize that StopFilter changed at all – and even though the position incrememnt behavior may now be more correct, it might drasticly change the results he gets when using dismax with a particular qs or ps value. Hence my point that this becomes a serious documentation concern: finding a way to make it clear to users what they need to consider when modifying luceneMatchVersion. I'm still all for allowing Version per component for experts use. But man, I wouldn't want to be in the boat, managing all my components as they mimic various bugs/bad behavior for various components. But if the example configs only show a global setting that isn't directly "linked" to any of hte individual object configurations, then normal users won't have any idea what could have/use individual luceneMatchVerssion settings anyway (even if they wanted to manage it piecemeal) Like i said: i've come around to the idea of having/advocating a global value. Once i got passed my mistaken thinking of "Version" as controlling "alternate versions" (as miller very clearly put it) I started to understand what you are all saying and i agree with you: a single global value is a good idea. My concern is just how to document things so that people don't get confused when they do need to change it.
        Hide
        Robert Muir added a comment -

        The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes - we're not fixing bugs where 1+1=3.

        You are wrong, they are absolutes.
        And here are the JIRA issues for stemming bugs, since you didnt take my hint to go and actually read them.

        LUCENE-2055: I used the snowball tests against these stemmers which claim to implement 'snowball algorithm', and they fail. This is an absolute, and the fix is to instead use snowball.
        LUCENE-2203: I used the snowball tests against these stemmers and they failed. Here is Martin Porter's confirmation that these are bugs: http://article.gmane.org/gmane.comp.search.snowball/1139

        Perhaps you should come up with a better example than stemming, as you don't know what you are talking about.

        Show
        Robert Muir added a comment - The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes - we're not fixing bugs where 1+1=3. You are wrong, they are absolutes. And here are the JIRA issues for stemming bugs, since you didnt take my hint to go and actually read them. LUCENE-2055 : I used the snowball tests against these stemmers which claim to implement 'snowball algorithm', and they fail. This is an absolute, and the fix is to instead use snowball. LUCENE-2203 : I used the snowball tests against these stemmers and they failed. Here is Martin Porter's confirmation that these are bugs: http://article.gmane.org/gmane.comp.search.snowball/1139 Perhaps you should come up with a better example than stemming, as you don't know what you are talking about.
        Hide
        Hoss Man added a comment -

        And here are the JIRA issues for stemming bugs, since you didnt take my hint to go and actually read them.

        sigh. I read both those issues when you filed them, and I agreed with your assessment that they are bugs we should fix – if i had thought you were wrong i would have said so in the issue comments.

        But that doesn't change the fact that sometimes people depend on buggy behavior – and sometimes those people depend on the buggy behavior without even realizing it. Bug fixes in a stemmer might make it more correct according to the stemmer algorithm specification, or the language semantics, but in some peculuar use cases an application might find the "correct" implementation less useful then the previous buggy version.

        This is one reason why things like CHANGES.txt are important: to draw attention to what has changed between two versions of a piece of software, so people can make informed opinions about what they should test in their own applications when they upgrade things under the covers. luceneMatchVersion should be no different. We should try to find a simple way to inform people "when you switch from luceneMatchVersion=X to luceneMatchVersion=Y here are the bug fixes you will get" so they know what to test to determine if they are adversely affected by that bug fix in some way (and find their own work around)

        Perhaps you should come up with a better example than stemming, as you don't know what you are talking about.

        1) It's true, I frequently don't know what i'm talking about ... this issue was a prime example, and i thank you, Uwe, and Miller for helping me realize that i was completely wrong in my understanding about the intended purpose of o.a.l.Version, and that a global setting for it in Solr makes total sense – But that doesn't make my concerns about documenting the affects of that global setting any less valid.

        2) Perhaps you should read the StopFilter example i already posted in my last comment...

        Robert mentioned in an earlier comment that StopFilter's position increment behavior changes depending on the luceneMatchVersion – what if an existing Solr 1.3 user notices a bug in some Tokenizer, and adds <luceneMatchVersion>3.0</luceneMatchVersion> to his schema.xml to fix it. Without clear documentation n everything that is affected when doing that, he may not realize that StopFilter changed at all – and even though the position incrememnt behavior may now be more correct, it might drasticly change the results he gets when using dismax with a particular qs or ps value. Hence my point that this becomes a serious documentation concern: finding a way to make it clear to users what they need to consider when modifying luceneMatchVersion.

        Show
        Hoss Man added a comment - And here are the JIRA issues for stemming bugs, since you didnt take my hint to go and actually read them. sigh. I read both those issues when you filed them, and I agreed with your assessment that they are bugs we should fix – if i had thought you were wrong i would have said so in the issue comments. But that doesn't change the fact that sometimes people depend on buggy behavior – and sometimes those people depend on the buggy behavior without even realizing it. Bug fixes in a stemmer might make it more correct according to the stemmer algorithm specification, or the language semantics, but in some peculuar use cases an application might find the "correct" implementation less useful then the previous buggy version. This is one reason why things like CHANGES.txt are important: to draw attention to what has changed between two versions of a piece of software, so people can make informed opinions about what they should test in their own applications when they upgrade things under the covers. luceneMatchVersion should be no different. We should try to find a simple way to inform people "when you switch from luceneMatchVersion=X to luceneMatchVersion=Y here are the bug fixes you will get" so they know what to test to determine if they are adversely affected by that bug fix in some way (and find their own work around) Perhaps you should come up with a better example than stemming, as you don't know what you are talking about. 1) It's true, I frequently don't know what i'm talking about ... this issue was a prime example, and i thank you, Uwe, and Miller for helping me realize that i was completely wrong in my understanding about the intended purpose of o.a.l.Version, and that a global setting for it in Solr makes total sense – But that doesn't make my concerns about documenting the affects of that global setting any less valid. 2) Perhaps you should read the StopFilter example i already posted in my last comment... Robert mentioned in an earlier comment that StopFilter's position increment behavior changes depending on the luceneMatchVersion – what if an existing Solr 1.3 user notices a bug in some Tokenizer, and adds <luceneMatchVersion>3.0</luceneMatchVersion> to his schema.xml to fix it. Without clear documentation n everything that is affected when doing that, he may not realize that StopFilter changed at all – and even though the position incrememnt behavior may now be more correct, it might drasticly change the results he gets when using dismax with a particular qs or ps value. Hence my point that this becomes a serious documentation concern: finding a way to make it clear to users what they need to consider when modifying luceneMatchVersion.
        Hide
        Robert Muir added a comment -

        2) Perhaps you should read the StopFilter example i already posted in my last comment...

        https://issues.apache.org/jira/browse/LUCENE-2094?focusedCommentId=12783932&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12783932

        as far as this one goes, i specifically commented before on this not being 'hidden' by Version (with Solr users in mind) but instead its own option that every user should consider, regardless of defaults.

        For the stopfilter posInc the user should think it through, its pretty strange, like i mention in my comment, that a definite article like 'the' gets a posInc bump in one language but not another, simply because it happens to be separated by a space.

        I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect. I can't really defend the whole stopfilter posInc thing, as again i think it doesn't make a whole lot of sense, maybe it works good for english I guess, I won't argue about it.

        Show
        Robert Muir added a comment - 2) Perhaps you should read the StopFilter example i already posted in my last comment... https://issues.apache.org/jira/browse/LUCENE-2094?focusedCommentId=12783932&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12783932 as far as this one goes, i specifically commented before on this not being 'hidden' by Version (with Solr users in mind) but instead its own option that every user should consider, regardless of defaults. For the stopfilter posInc the user should think it through, its pretty strange, like i mention in my comment, that a definite article like 'the' gets a posInc bump in one language but not another, simply because it happens to be separated by a space. I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect. I can't really defend the whole stopfilter posInc thing, as again i think it doesn't make a whole lot of sense, maybe it works good for english I guess, I won't argue about it.
        Hide
        Hoss Man added a comment -

        I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect.

        ...which is all well and good, but it just re-iterates the need for really good documentation about what is impacted by changing a global Version setting – otherwise users might be depending on a default behavior that is going to change when Version as bumped, and they may not even realize it.

        Bear in mind: these are just the nuances that people need to worry about when considering a switch from 2.4 to 2.9 to 3.0 ... there will likely be a lot more of these over time.

        And just to be as crystal clear as i possibly can:

        • my concern is purely about how to document this stuff.
        • i do in fact agree that a global luceneVersionMatch option is a good idea
        Show
        Hoss Man added a comment - I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect. ...which is all well and good, but it just re-iterates the need for really good documentation about what is impacted by changing a global Version setting – otherwise users might be depending on a default behavior that is going to change when Version as bumped, and they may not even realize it. Bear in mind: these are just the nuances that people need to worry about when considering a switch from 2.4 to 2.9 to 3.0 ... there will likely be a lot more of these over time. And just to be as crystal clear as i possibly can: my concern is purely about how to document this stuff. i do in fact agree that a global luceneVersionMatch option is a good idea
        Hide
        Uwe Schindler added a comment -

        This patch was committed to the Lucene-trunk upgrade branch. It is changed to not make the factories CoreAware.

        Show
        Uwe Schindler added a comment - This patch was committed to the Lucene-trunk upgrade branch. It is changed to not make the factories CoreAware.
        Hide
        Uwe Schindler added a comment -

        I also added support for instantiating Lucene Analyzers directly, that broke with the 3.0-upgrade. The new code now prefers a one-arg-Version-ctor and falls back to the no-arg one. The only thing that is not working at the moment is the -Aware stuff, as SolrResourceLoader.newInstance() was not useable.

        Show
        Uwe Schindler added a comment - I also added support for instantiating Lucene Analyzers directly, that broke with the 3.0-upgrade. The new code now prefers a one-arg-Version-ctor and falls back to the no-arg one. The only thing that is not working at the moment is the -Aware stuff, as SolrResourceLoader.newInstance() was not useable.
        Hide
        Uwe Schindler added a comment -

        Just for documentation:
        Here the patches with improvements to the version support for the Lucene-trunk upgrade branch.

        • More lenient matchVersion support ("V.V")
        • Default matchVersion for tests
        • Remove code duplication and some additional checks for analysis plugins that need version support to enforce the version
        Show
        Uwe Schindler added a comment - Just for documentation: Here the patches with improvements to the version support for the Lucene-trunk upgrade branch. More lenient matchVersion support ("V.V") Default matchVersion for tests Remove code duplication and some additional checks for analysis plugins that need version support to enforce the version
        Hide
        Hoss Man added a comment -

        Correcting Fix Version based on CHANGES.txt, see this thread for more details...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Show
        Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
        Hide
        Robert Muir added a comment -

        I think this issue has been resolved for some time.

        Show
        Robert Muir added a comment - I think this issue has been resolved for some time.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development