Lucene - Core
  1. Lucene - Core
  2. LUCENE-2510

migrate solr analysis factories to analyzers module

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-BETA, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-2413 all TokenStreams were consolidated into the analyzers module.

      This is a good step, but I think the next step is to put the Solr factories into the analyzers module, too.

      This would make analyzers artifacts plugins to both lucene and solr, with benefits such as:

      • users could use the old analyzers module with solr, too. This is a good step to use real library versions instead of Version for backwards compat.
      • analyzers modules such as smartcn and icu, that aren't currently available to solr users due to large file sizes or dependencies, would be simple optional plugins to solr and easily available to users that want them.

      Rough sketch in this thread: http://www.lucidimagination.com/search/document/3465a0e55ba94d58/solr_and_analyzers_module

      Practically, I havent looked much and don't really have a plan for how this will work yet, so ideas are very welcome.

      1. LUCENE-2510-movefactories.sh
        30 kB
        Chris Male
      2. LUCENE-2510-simplify-tests.patch
        129 kB
        Chris Male
      3. LUCENE-2510-movefactories.sh
        17 kB
        Chris Male
      4. LUCENE-2510-multitermcomponent.patch
        23 kB
        Chris Male
      5. LUCENE-2510-multitermcomponent.patch
        17 kB
        Chris Male
      6. LUCENE-2510-parent-classes.patch
        162 kB
        Chris Male
      7. LUCENE-2510-parent-classes.patch
        123 kB
        Chris Male
      8. LUCENE-2510-resourceloader-bw.patch
        6 kB
        Chris Male
      9. LUCENE-2510-parent-classes.patch
        107 kB
        Chris Male
      10. LUCENE-2510.patch
        44 kB
        Chris Male
      11. LUCENE-2510.patch
        44 kB
        Chris Male
      12. LUCENE-2510.patch
        45 kB
        Chris Male

        Activity

        Hide
        Chris Male added a comment -

        +1 to this idea.

        From my perspective the only issue is loading resources used in the factories, which in Solr is currently handled through the SolrResourceLoader. My suggestion for this is to migrate Solr's ResourceLoader interface, and maybe provide simple FileResourceLoader and ClassPathResourceLoaders that Lucene users can use.

        Show
        Chris Male added a comment - +1 to this idea. From my perspective the only issue is loading resources used in the factories, which in Solr is currently handled through the SolrResourceLoader. My suggestion for this is to migrate Solr's ResourceLoader interface, and maybe provide simple FileResourceLoader and ClassPathResourceLoaders that Lucene users can use.
        Hide
        Robert Muir added a comment -

        Chris, yes I think at a glance this is where I got stuck

        Related to this, there is duplication in resource loading code already that would be nice to clean up.

        For example, Lucene and Solr have their own separate stopword-loading code etc. But I don't really like some of the things Lucene's WordListLoader does:

        • The lucene WordListLoader builds HashMaps and HashSets but this is wasteful since these are always then copied to CharArraySet/Maps... Solr's just builds CharArraySet/Map up front.
        • the Solr file loading code has some features like trying to guess the size of the set/map up front for faster loading.
        • the Solr stopword loading code is more user-friendly as it ignores BOM markers etc.

        I think it would be good to only have one piece of code for this functionality and for it to be optimal.

        Show
        Robert Muir added a comment - Chris, yes I think at a glance this is where I got stuck Related to this, there is duplication in resource loading code already that would be nice to clean up. For example, Lucene and Solr have their own separate stopword-loading code etc. But I don't really like some of the things Lucene's WordListLoader does: The lucene WordListLoader builds HashMaps and HashSets but this is wasteful since these are always then copied to CharArraySet/Maps... Solr's just builds CharArraySet/Map up front. the Solr file loading code has some features like trying to guess the size of the set/map up front for faster loading. the Solr stopword loading code is more user-friendly as it ignores BOM markers etc. I think it would be good to only have one piece of code for this functionality and for it to be optimal.
        Hide
        Chris Male added a comment -

        I think if we consolidate how to load resources, then we can easily reduce the loading of word lists to a single optimal way.

        Show
        Chris Male added a comment - I think if we consolidate how to load resources, then we can easily reduce the loading of word lists to a single optimal way.
        Hide
        Chris Male added a comment -

        Heres a few things I think we should do as part of this work:

        • Convert the factories to beans. This will make the properties that a Factory has a great deal clearer as it won't be necessary to initialize the Factory using a Map. It'll also increase their testability. Part of this will require changing how Solr initializes the Factories from its XML configuration. The simplist way is to change the attribute names in the configuration to match the bean property.
        • Either move the ResourceLoader interface to Lucene, or do some fancy property injection just in Solr and remove ResourceLoading from factories. The former option means Lucene, Solr and any future modules can all use a single consist interface for resource loading. But it may also introduce something which isn't Lucene's responsibility. The latter would involve the factories having List<String> properties for example, and Solr detecting this and injecting the list from its own ResourceLoader. It introduces more complexity into Solr, but presents a very clean interface in the Factory which increases its readability, testability, and makes them more programmatically friendly.
        Show
        Chris Male added a comment - Heres a few things I think we should do as part of this work: Convert the factories to beans. This will make the properties that a Factory has a great deal clearer as it won't be necessary to initialize the Factory using a Map. It'll also increase their testability. Part of this will require changing how Solr initializes the Factories from its XML configuration. The simplist way is to change the attribute names in the configuration to match the bean property. Either move the ResourceLoader interface to Lucene, or do some fancy property injection just in Solr and remove ResourceLoading from factories. The former option means Lucene, Solr and any future modules can all use a single consist interface for resource loading. But it may also introduce something which isn't Lucene's responsibility. The latter would involve the factories having List<String> properties for example, and Solr detecting this and injecting the list from its own ResourceLoader. It introduces more complexity into Solr, but presents a very clean interface in the Factory which increases its readability, testability, and makes them more programmatically friendly.
        Hide
        Chris Male added a comment -

        I'm getting this issue back online.

        First step, moving ResourceLoader/Aware into the analysis module. I can imagine later on different implementations of ResourceLoader which maybe just uses the Classpath or something.

        Command for patch:

        svn mv solr/solrj/src/java/org/apache/solr/common/ResourceLoader.java modules/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoader.java
        svn mv solr/core/src/java/org/apache/solr/util/plugin/ResourceLoaderAware.java modules/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoaderAware.java
        
        Show
        Chris Male added a comment - I'm getting this issue back online. First step, moving ResourceLoader/Aware into the analysis module. I can imagine later on different implementations of ResourceLoader which maybe just uses the Classpath or something. Command for patch: svn mv solr/solrj/src/java/org/apache/solr/common/ResourceLoader.java modules/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoader.java svn mv solr/core/src/java/org/apache/solr/util/plugin/ResourceLoaderAware.java modules/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoaderAware.java
        Hide
        Robert Muir added a comment -

        Hmm looking at this ResourceLoader, what is the purpose of the newInstance method?

        It seems to be unrelated to the 'resource loading', i think it should be something else?

        Separately, do we have any vague idea of a plan of how WordListLoader can implement this interface?
        I don't think we have to do that immediately to proceed, but long term I think it would make sense,
        since currently we have some duplicate code between lucene and solr here:

        • take a look at wordlistloader
        • take a look at the protected methods in BaseTokenStreamFactory.
        Show
        Robert Muir added a comment - Hmm looking at this ResourceLoader, what is the purpose of the newInstance method? It seems to be unrelated to the 'resource loading', i think it should be something else? Separately, do we have any vague idea of a plan of how WordListLoader can implement this interface? I don't think we have to do that immediately to proceed, but long term I think it would make sense, since currently we have some duplicate code between lucene and solr here: take a look at wordlistloader take a look at the protected methods in BaseTokenStreamFactory.
        Hide
        Chris Male added a comment -

        what is the purpose of the newInstance method?

        If you take a look at org.apache.solr.analysis.DelimitedPayloadTokenFilterFactory you'll see an example of how it's used.

        Looking at the implementation in SolrResourceLoader, it seems to facilitate two things:

        • The use of simplified solr.* package names
        • In FSTSynonymFilterFactory for example, newInstance is used to load other components. Consequently SolrResourceLoader adds the instantiated classes to its tracking of SolrCoreAware, ResourceLoaderAware, etc.

        With all that said, its only used in 3 Factories (but a lot of other Solr code). Perhaps we can break it out somehow.

        Separately, do we have any vague idea of a plan of how WordListLoader can implement this interface?

        I don't at this stage, but you're right, there is duplication. Off the top of my head I think we'd want to move everything over to using ResourceLoader, but somehow incorporate the WordlistLoader logic somewhere.

        Show
        Chris Male added a comment - what is the purpose of the newInstance method? If you take a look at org.apache.solr.analysis.DelimitedPayloadTokenFilterFactory you'll see an example of how it's used. Looking at the implementation in SolrResourceLoader, it seems to facilitate two things: The use of simplified solr.* package names In FSTSynonymFilterFactory for example, newInstance is used to load other components. Consequently SolrResourceLoader adds the instantiated classes to its tracking of SolrCoreAware, ResourceLoaderAware, etc. With all that said, its only used in 3 Factories (but a lot of other Solr code). Perhaps we can break it out somehow. Separately, do we have any vague idea of a plan of how WordListLoader can implement this interface? I don't at this stage, but you're right, there is duplication. Off the top of my head I think we'd want to move everything over to using ResourceLoader, but somehow incorporate the WordlistLoader logic somewhere.
        Hide
        Robert Muir added a comment -

        With all that said, its only used in 3 Factories (but a lot of other Solr code). Perhaps we can break it out somehow.

        I guess my main problem with it is the generics (it returns Object).
        Seems like the generics could be fixed so its parameterized to return ? extends X.
        If we add generics violations to the analyzers module, Uwe will not be happy

        I don't at this stage, but you're right, there is duplication. Off the top of my head I think we'd want to move everything over to using ResourceLoader, but somehow incorporate the WordlistLoader logic somewhere.

        Right I was just thinking really this stuff should be mostly in one place. I think
        its a little better now but there is some stuff in both places. I guess I can let
        that go, but it would be cool to have some sort of plan here, and if we don't tackle
        it, at least open up a followup issue since we are talking about an interface here:
        we won't be able to easy fix it without hard API breaks if we need.

        Don't get me wrong: when interfaces are the right choice, we should use them without fear!
        I think we just need to be extra careful up-front since we really should not break
        them across minor releases.

        Show
        Robert Muir added a comment - With all that said, its only used in 3 Factories (but a lot of other Solr code). Perhaps we can break it out somehow. I guess my main problem with it is the generics (it returns Object). Seems like the generics could be fixed so its parameterized to return ? extends X. If we add generics violations to the analyzers module, Uwe will not be happy I don't at this stage, but you're right, there is duplication. Off the top of my head I think we'd want to move everything over to using ResourceLoader, but somehow incorporate the WordlistLoader logic somewhere. Right I was just thinking really this stuff should be mostly in one place. I think its a little better now but there is some stuff in both places. I guess I can let that go, but it would be cool to have some sort of plan here, and if we don't tackle it, at least open up a followup issue since we are talking about an interface here: we won't be able to easy fix it without hard API breaks if we need. Don't get me wrong: when interfaces are the right choice, we should use them without fear! I think we just need to be extra careful up-front since we really should not break them across minor releases.
        Hide
        Chris Male added a comment - - edited

        I guess my main problem with it is the generics (it returns Object).
        Seems like the generics could be fixed so its parameterized to return ? extends X.
        If we add generics violations to the analyzers module, Uwe will not be happy

        +1

        I thought along the same lines so we can definitely clean it up. Wouldn't want to get a ticket from the policeman.

        Right I was just thinking really this stuff should be mostly in one place. I think
        its a little better now but there is some stuff in both places. I guess I can let
        that go, but it would be cool to have some sort of plan here, and if we don't tackle
        it, at least open up a followup issue since we are talking about an interface here:
        we won't be able to easy fix it without hard API breaks if we need.

        I'll think on it a bit and see if anybody else has any opinions. I agree that we need to be extra careful here.

        Show
        Chris Male added a comment - - edited I guess my main problem with it is the generics (it returns Object). Seems like the generics could be fixed so its parameterized to return ? extends X. If we add generics violations to the analyzers module, Uwe will not be happy +1 I thought along the same lines so we can definitely clean it up. Wouldn't want to get a ticket from the policeman. Right I was just thinking really this stuff should be mostly in one place. I think its a little better now but there is some stuff in both places. I guess I can let that go, but it would be cool to have some sort of plan here, and if we don't tackle it, at least open up a followup issue since we are talking about an interface here: we won't be able to easy fix it without hard API breaks if we need. I'll think on it a bit and see if anybody else has any opinions. I agree that we need to be extra careful here.
        Hide
        Chris Male added a comment -

        Patch updated to trunk.

        I'm going to commit this shortly and then tackle the factories themselves.

        Show
        Chris Male added a comment - Patch updated to trunk. I'm going to commit this shortly and then tackle the factories themselves.
        Hide
        Chris Male added a comment -

        Err... patch is broke, will fix.

        Show
        Chris Male added a comment - Err... patch is broke, will fix.
        Hide
        Chris Male added a comment -

        Actually it works, just needed to rebuild analysis/common.

        Show
        Chris Male added a comment - Actually it works, just needed to rebuild analysis/common.
        Hide
        Chris Male added a comment -

        Updated patch for moving ResourceLoader/Aware with a new command:

        svn mv solr/solrj/src/java/org/apache/solr/common/ResourceLoader.java lucene/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoader.java
        svn mv solr/core/src/java/org/apache/solr/util/plugin/ResourceLoaderAware.java lucene/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoaderAware.java
        

        I'm going to commit this and then start moving the Factory classes incrementally.

        Show
        Chris Male added a comment - Updated patch for moving ResourceLoader/Aware with a new command: svn mv solr/solrj/src/java/org/apache/solr/common/ResourceLoader.java lucene/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoader.java svn mv solr/core/src/java/org/apache/solr/util/plugin/ResourceLoaderAware.java lucene/analysis/common/src/java/org/apache/lucene/analysis/util/ResourceLoaderAware.java I'm going to commit this and then start moving the Factory classes incrementally.
        Hide
        Chris Male added a comment -

        Patch which moves the Factory interfaces and base implementations (and some associated classes).

        Command:

        svn mkdir lucene/analysis/common/src/java/org/apache/lucene/analysis/factory
        svn mv solr/core/src/java/org/apache/solr/analysis/BaseCharFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseCharFilterFactory.java
        svn mv solr/core/src/java/org/apache/solr/analysis/BaseTokenFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseTokenFilterFactory.java
        svn mv solr/core/src/java/org/apache/solr/analysis/BaseTokenizerFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseTokenizerFactory.java
        svn mv solr/core/src/java/org/apache/solr/analysis/BaseTokenStreamFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseTokenStreamFactory.java
        svn mv solr/core/src/java/org/apache/solr/analysis/CharFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/CharFilterFactory.java
        svn mv solr/core/src/java/org/apache/solr/analysis/MultiTermAwareComponent.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/MultiTermAwareComponent.java
        svn mv solr/core/src/java/org/apache/solr/analysis/TokenFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/TokenFilterFactory.java
        svn mv solr/core/src/java/org/apache/solr/analysis/TokenizerFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/TokenizerFactory.java
        svn mv solr/core/src/java/org/apache/solr/analysis/InitializationException.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/InitializationException.java
        

        A big TODO after all this moving is to collapse the base implementations into a single BaseAnalysisFactory class and get rid of FactoryUtils.

        Tests pass.

        Show
        Chris Male added a comment - Patch which moves the Factory interfaces and base implementations (and some associated classes). Command: svn mkdir lucene/analysis/common/src/java/org/apache/lucene/analysis/factory svn mv solr/core/src/java/org/apache/solr/analysis/BaseCharFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseCharFilterFactory.java svn mv solr/core/src/java/org/apache/solr/analysis/BaseTokenFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseTokenFilterFactory.java svn mv solr/core/src/java/org/apache/solr/analysis/BaseTokenizerFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseTokenizerFactory.java svn mv solr/core/src/java/org/apache/solr/analysis/BaseTokenStreamFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/BaseTokenStreamFactory.java svn mv solr/core/src/java/org/apache/solr/analysis/CharFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/CharFilterFactory.java svn mv solr/core/src/java/org/apache/solr/analysis/MultiTermAwareComponent.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/MultiTermAwareComponent.java svn mv solr/core/src/java/org/apache/solr/analysis/TokenFilterFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/TokenFilterFactory.java svn mv solr/core/src/java/org/apache/solr/analysis/TokenizerFactory.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/TokenizerFactory.java svn mv solr/core/src/java/org/apache/solr/analysis/InitializationException.java lucene/analysis/common/src/java/org/apache/lucene/analysis/factory/InitializationException.java A big TODO after all this moving is to collapse the base implementations into a single BaseAnalysisFactory class and get rid of FactoryUtils. Tests pass.
        Hide
        Chris Male added a comment -

        Need to also fix the relationship between the Solr uima contrib and analyzers-common in appropriate configuration files too.

        Show
        Chris Male added a comment - Need to also fix the relationship between the Solr uima contrib and analyzers-common in appropriate configuration files too.
        Hide
        Chris Male added a comment -

        Actually I think I've been a little naive with backwards compatibility here. I think I need to ensure that any user created Factory implementations continue to work and existing schemas continue to load. Otherwise upgrading to Solr 4 is going to be an epic hassle.

        Show
        Chris Male added a comment - Actually I think I've been a little naive with backwards compatibility here. I think I need to ensure that any user created Factory implementations continue to work and existing schemas continue to load. Otherwise upgrading to Solr 4 is going to be an epic hassle.
        Hide
        Chris Male added a comment -

        Patch which does some backwards compat work for ResourceLoader/Aware.

        Deprecated versions of both are added in their old locations and they extend from those in the new locations. SolrResourceLoader implements the deprecated ResourceLoader so it can continue to be used with Factories expecting the old ResourceLoader. It also includes a check for plugins using the old ResourceLoaderAware and logs a warning recommending that they are upgraded.

        Show
        Chris Male added a comment - Patch which does some backwards compat work for ResourceLoader/Aware. Deprecated versions of both are added in their old locations and they extend from those in the new locations. SolrResourceLoader implements the deprecated ResourceLoader so it can continue to be used with Factories expecting the old ResourceLoader. It also includes a check for plugins using the old ResourceLoaderAware and logs a warning recommending that they are upgraded.
        Hide
        Steve Rowe added a comment -

        Chris, before the changes made here, solrj did not have a dependency on lucene-core, but now it does. Is it possible to move the dependency to solr-core?

        The Maven build failed today because of this <https://builds.apache.org/job/Lucene-Solr-Maven-trunk/476/consoleText>:

        [INFO] Error for project: Apache Solr Solrj (during install)
        [INFO] ------------------------------------------------------------------------
        [INFO] Compilation failure
        /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Maven-trunk/checkout/solr/solrj/src/java/org/apache/solr/common/ResourceLoader.java:[25,71] package org.apache.lucene.analysis.util does not exist

        The short-term fix for the Maven build would be just adding a lucene-core dependency to the solrj POM.

        Show
        Steve Rowe added a comment - Chris, before the changes made here, solrj did not have a dependency on lucene-core, but now it does. Is it possible to move the dependency to solr-core? The Maven build failed today because of this < https://builds.apache.org/job/Lucene-Solr-Maven-trunk/476/consoleText >: [INFO] Error for project: Apache Solr Solrj (during install) [INFO] ------------------------------------------------------------------------ [INFO] Compilation failure /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Maven-trunk/checkout/solr/solrj/src/java/org/apache/solr/common/ResourceLoader.java: [25,71] package org.apache.lucene.analysis.util does not exist The short-term fix for the Maven build would be just adding a lucene-core dependency to the solrj POM.
        Hide
        Yonik Seeley added a comment -

        before the changes made here, solrj did not have a dependency on lucene-core, but now it does

        Ugh... can we avoid this extra dependency? SolrJ is just a client and is not supposed to depend on lucene or solr core.

        Show
        Yonik Seeley added a comment - before the changes made here, solrj did not have a dependency on lucene-core, but now it does Ugh... can we avoid this extra dependency? SolrJ is just a client and is not supposed to depend on lucene or solr core.
        Hide
        Ryan McKinley added a comment -

        The only reason solrj has that dependency is for the deprecated interface:

        public interface ResourceLoader extends org.apache.lucene.analysis.util.ResourceLoader 
        

        I vote we just drop ResourceLoader from the solrj client API at 4.0 rather then 5.0

        alternatively, we could put the deprecated interface in solr-core, but that makes a mess of OSGI bundles (i think)

        Show
        Ryan McKinley added a comment - The only reason solrj has that dependency is for the deprecated interface: public interface ResourceLoader extends org.apache.lucene.analysis.util.ResourceLoader I vote we just drop ResourceLoader from the solrj client API at 4.0 rather then 5.0 alternatively, we could put the deprecated interface in solr-core, but that makes a mess of OSGI bundles (i think)
        Hide
        Chris Male added a comment -

        Ah sorry guys, I went to a lot of effort to avoid this dependency, then went and added it myself.

        I still think just dropping the interface at this stage is going to make upgrading to Solr 4 a hassle so I think the best option is to move the interface into solr-core under the same package name. I don't see why that would hurt OSGI?

        Show
        Chris Male added a comment - Ah sorry guys, I went to a lot of effort to avoid this dependency, then went and added it myself. I still think just dropping the interface at this stage is going to make upgrading to Solr 4 a hassle so I think the best option is to move the interface into solr-core under the same package name. I don't see why that would hurt OSGI?
        Hide
        Chris Male added a comment - - edited

        Tests pass so I've moved it into solr-core to remove this dependency issue. If there is some problem related to OSGI, we can then decide if we really do want to drop the interface at this stage.

        Show
        Chris Male added a comment - - edited Tests pass so I've moved it into solr-core to remove this dependency issue. If there is some problem related to OSGI, we can then decide if we really do want to drop the interface at this stage.
        Hide
        Ryan McKinley added a comment -

        I don't see why that would hurt OSGI

        I think OSGI gets upset when you have different .jar files have classes in the same package – I don't really know or care though.

        ------

        Shouldn't SolrResourceLoader depend on org.apache.lucene.analysis.util.ResourceLoader rather then the deprecated one?

        Show
        Ryan McKinley added a comment - I don't see why that would hurt OSGI I think OSGI gets upset when you have different .jar files have classes in the same package – I don't really know or care though. ------ Shouldn't SolrResourceLoader depend on org.apache.lucene.analysis.util.ResourceLoader rather then the deprecated one?
        Hide
        Chris Male added a comment -

        Shouldn't SolrResourceLoader depend on org.apache.lucene.analysis.util.ResourceLoader rather then the deprecated one?

        No, that would prevent SolrResourceLoader from being able to be used with classes that still use the deprecated ResourceLoader. This way an analysis Factory which relies on the deprecated ResourceLoader can be loaded into Solr 4 without error.

        Show
        Chris Male added a comment - Shouldn't SolrResourceLoader depend on org.apache.lucene.analysis.util.ResourceLoader rather then the deprecated one? No, that would prevent SolrResourceLoader from being able to be used with classes that still use the deprecated ResourceLoader. This way an analysis Factory which relies on the deprecated ResourceLoader can be loaded into Solr 4 without error.
        Hide
        Chris Male added a comment -

        New massive patch for handling the Base parent classes and interfaces.

        Originally I wanted to remove the Base parent classes but I think they'll be useful for validation and any future backwards compat logic.

        Focus is on bw compat here, so nothing is removed (except InitializationException which I only added recently), only deprecated.

        Show
        Chris Male added a comment - New massive patch for handling the Base parent classes and interfaces. Originally I wanted to remove the Base parent classes but I think they'll be useful for validation and any future backwards compat logic. Focus is on bw compat here, so nothing is removed (except InitializationException which I only added recently), only deprecated.
        Hide
        Robert Muir added a comment -

        I dont think back compat is really important?

        No ones code is going to work without changes for 4.0 anyway...
        I dont think we should add a bunch of dead weight classes for no reason.

        If we remove these, then do we need a .factories package? Cant we just
        put these couple classes in .util?

        Show
        Robert Muir added a comment - I dont think back compat is really important? No ones code is going to work without changes for 4.0 anyway... I dont think we should add a bunch of dead weight classes for no reason. If we remove these, then do we need a .factories package? Cant we just put these couple classes in .util?
        Hide
        Robert Muir added a comment -

        I think the interfaces are also unnecessary. Why not something like:

        • AnalysisFactory (has init(Args), luceneMatchVersion, etc etc)
          • TokenizerFactory
          • TokenFilterFactory
          • CharFilterFactory
        Show
        Robert Muir added a comment - I think the interfaces are also unnecessary. Why not something like: AnalysisFactory (has init(Args), luceneMatchVersion, etc etc) TokenizerFactory TokenFilterFactory CharFilterFactory
        Hide
        Chris Male added a comment -

        No ones code is going to work without changes for 4.0 anyway...

        Fair point

        If we remove these, then do we need a .factories package? Cant we just
        put these couple classes in .util?

        Yeah I can make that change. I'll update the patch.

        Show
        Chris Male added a comment - No ones code is going to work without changes for 4.0 anyway... Fair point If we remove these, then do we need a .factories package? Cant we just put these couple classes in .util? Yeah I can make that change. I'll update the patch.
        Hide
        Chris Male added a comment -

        TokenizerFactory
        TokenFilterFactory
        CharFilterFactory

        As classes instead of interfaces?

        Show
        Chris Male added a comment - TokenizerFactory TokenFilterFactory CharFilterFactory As classes instead of interfaces?
        Hide
        Robert Muir added a comment -

        Yes. practically, all of the impls 'extend UselessBaseXXXFactory' today anyway

        Show
        Robert Muir added a comment - Yes. practically, all of the impls 'extend UselessBaseXXXFactory' today anyway
        Hide
        Chris Male added a comment -

        Okay, good idea.

        Show
        Chris Male added a comment - Okay, good idea.
        Hide
        Robert Muir added a comment -

        re: what is the purpose of the newInstance method?

        If you take a look at org.apache.solr.analysis.DelimitedPayloadTokenFilterFactory you'll see an example of how it's used.

        Looking at the implementation in SolrResourceLoader, it seems to facilitate two things:

        The use of simplified solr.* package names
        In FSTSynonymFilterFactory for example, newInstance is used to load other components. Consequently bq. bq. SolrResourceLoader adds the instantiated classes to its tracking of SolrCoreAware, ResourceLoaderAware, bq. etc.
        With all that said, its only used in 3 Factories (but a lot of other Solr code). Perhaps we can break it bq. out somehow.

        I think we should revisit this. I don't like placing this into the analyzers module when not many factories actually use it, instead a lot of unrelated code in solr actually uses it. I think this could cause a mess.

        On the other hand, both the things this provides can be achieved in other ways. For example, if we use NamedSPILoader instead to allow components such as factories to be found by name, then we can support "solr.WhitespaceTokenizerFactory" because TokenizerFactory.forName("WhitespaceTokenizerFactory") works. Using the SPI mechanism would allow for us to have completely pluggable analysis modules, also operations like listAll() work in case you want to enumerate a list (imagine someone that doesnt want a xml configuration but configured by a GUI or something like that instead). We also keep sane packaging within the analysis modules and keep type safety, and solr still keeps its solr.XXX syntax without reflecting a zillion packages or other crazy things.

        Show
        Robert Muir added a comment - re: what is the purpose of the newInstance method? If you take a look at org.apache.solr.analysis.DelimitedPayloadTokenFilterFactory you'll see an example of how it's used. Looking at the implementation in SolrResourceLoader, it seems to facilitate two things: The use of simplified solr.* package names In FSTSynonymFilterFactory for example, newInstance is used to load other components. Consequently bq. bq. SolrResourceLoader adds the instantiated classes to its tracking of SolrCoreAware, ResourceLoaderAware, bq. etc. With all that said, its only used in 3 Factories (but a lot of other Solr code). Perhaps we can break it bq. out somehow. I think we should revisit this. I don't like placing this into the analyzers module when not many factories actually use it, instead a lot of unrelated code in solr actually uses it. I think this could cause a mess. On the other hand, both the things this provides can be achieved in other ways. For example, if we use NamedSPILoader instead to allow components such as factories to be found by name, then we can support "solr.WhitespaceTokenizerFactory" because TokenizerFactory.forName("WhitespaceTokenizerFactory") works. Using the SPI mechanism would allow for us to have completely pluggable analysis modules, also operations like listAll() work in case you want to enumerate a list (imagine someone that doesnt want a xml configuration but configured by a GUI or something like that instead). We also keep sane packaging within the analysis modules and keep type safety, and solr still keeps its solr.XXX syntax without reflecting a zillion packages or other crazy things.
        Hide
        Chris Male added a comment -

        New patch which follows Robert's suggestions.

        All parent classes are distilled down to 4 classes: AbstractAnalysisFactory, TokenizerFactory, TokenFilterFactory and CharFilterFactory.

        All implementations, Solr code and Javadocs have been updated. Tests pass.

        Show
        Chris Male added a comment - New patch which follows Robert's suggestions. All parent classes are distilled down to 4 classes: AbstractAnalysisFactory, TokenizerFactory, TokenFilterFactory and CharFilterFactory. All implementations, Solr code and Javadocs have been updated. Tests pass.
        Hide
        Chris Male added a comment -

        I think we should revisit this. I don't like placing this into the analyzers module when not many factories actually use it, instead a lot of unrelated code in solr actually uses it. I think this could cause a mess.

        I agree. It feels messy where it is currently.

        For example, if we use NamedSPILoader instead to allow components such as factories to be found by name, then we can support "solr.WhitespaceTokenizerFactory" because TokenizerFactory.forName("WhitespaceTokenizerFactory") works.

        I don't really know much about NamedSPILoader but I think what you're suggesting. How would we support Factories loading unrelated classes like they can through ResourceLoader now? Assume they're on the classpath and use Class.forName?

        Show
        Chris Male added a comment - I think we should revisit this. I don't like placing this into the analyzers module when not many factories actually use it, instead a lot of unrelated code in solr actually uses it. I think this could cause a mess. I agree. It feels messy where it is currently. For example, if we use NamedSPILoader instead to allow components such as factories to be found by name, then we can support "solr.WhitespaceTokenizerFactory" because TokenizerFactory.forName("WhitespaceTokenizerFactory") works. I don't really know much about NamedSPILoader but I think what you're suggesting. How would we support Factories loading unrelated classes like they can through ResourceLoader now? Assume they're on the classpath and use Class.forName?
        Hide
        Robert Muir added a comment -

        I don't really know much about NamedSPILoader but I think what you're suggesting. How would we support Factories loading unrelated classes like they can through ResourceLoader now? Assume they're on the classpath and use Class.forName?

        It needs more discussion (and input from Uwe would help!), but it works like Charset.forName("ASCII") etc. We use this already for codecs and postingsformats (Codec.forName, Codec.listAllCodecs, ...).

        Have a look at lucene/core/src/resources/META-INF/services for the idea. Basically you "register" your classes in
        your jar file this way: additional jar files (e.g. look at lucene/test-framework/src/resources/META-INF) can load more classes.

        So this could support some idea like TokenizerFactory.forName("Whitespace") or something simple like that. So someone would not need to use org.apache.solr.analysis.xxx namespace to be able to load their analyzer stuff easily, they use whatever package they want and register in their META_INF. And added jar files (other analysis jars), are automatically available this way.

        I think Uwe mentioned this idea before, though I think he had Analyzers in mind (e.g. provide language code and get back analyzer or something). Anyway thats for another issue

        Just something worth consideration if we want to make these modules really pluggable. On the other hand we shouldn't use anything overkill if its not the right fit...

        Show
        Robert Muir added a comment - I don't really know much about NamedSPILoader but I think what you're suggesting. How would we support Factories loading unrelated classes like they can through ResourceLoader now? Assume they're on the classpath and use Class.forName? It needs more discussion (and input from Uwe would help!), but it works like Charset.forName("ASCII") etc. We use this already for codecs and postingsformats (Codec.forName, Codec.listAllCodecs, ...). Have a look at lucene/core/src/resources/META-INF/services for the idea. Basically you "register" your classes in your jar file this way: additional jar files (e.g. look at lucene/test-framework/src/resources/META-INF) can load more classes. So this could support some idea like TokenizerFactory.forName("Whitespace") or something simple like that. So someone would not need to use org.apache.solr.analysis.xxx namespace to be able to load their analyzer stuff easily, they use whatever package they want and register in their META_INF. And added jar files (other analysis jars), are automatically available this way. I think Uwe mentioned this idea before, though I think he had Analyzers in mind (e.g. provide language code and get back analyzer or something). Anyway thats for another issue Just something worth consideration if we want to make these modules really pluggable. On the other hand we shouldn't use anything overkill if its not the right fit...
        Hide
        Robert Muir added a comment -

        How would we support Factories loading unrelated classes like they can through ResourceLoader now? Assume they're on the classpath and use Class.forName?

        I think there are only a few situations of this? Like your payload example? If PayloadEncoder really needs to be
        pluggable by class then you always also put it under SPI too (PayloadEncoder.forName).

        In general if we decide on the SPI approach, I think it would be useful to think of improving the solr config too,
        because the current configuration is so verbose and redundant.
        e.g. for backwards compat we could support:

        <charFilter class="solr.HtmlStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
        

        but going forward this would be cleaner IMO, just use the SPI name directly:

        <charFilter name="HtmlStrip"/>
        <tokenizer name="Standard"/>
        <filter name="LowerCase"/>
        <filter name="PorterStem"/>
        
        Show
        Robert Muir added a comment - How would we support Factories loading unrelated classes like they can through ResourceLoader now? Assume they're on the classpath and use Class.forName? I think there are only a few situations of this? Like your payload example? If PayloadEncoder really needs to be pluggable by class then you always also put it under SPI too (PayloadEncoder.forName). In general if we decide on the SPI approach, I think it would be useful to think of improving the solr config too, because the current configuration is so verbose and redundant. e.g. for backwards compat we could support: <charFilter class="solr.HtmlStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> but going forward this would be cleaner IMO, just use the SPI name directly: <charFilter name="HtmlStrip"/> <tokenizer name="Standard"/> <filter name="LowerCase"/> <filter name="PorterStem"/>
        Hide
        Chris Male added a comment -

        That would be of a huge benefit, I definitely think it's something we should pursue.

        Show
        Chris Male added a comment - That would be of a huge benefit, I definitely think it's something we should pursue.
        Hide
        Chris Male added a comment -

        We're actually going to need this sooner rather than later, in order to continue to support the

        {solr.*}

        suffix for analysis factories. Currently SolrResourceLoader just has a list a directories it tries when presented with the suffix. When we move all the factories into their directories, it won't be able to do that any more.

        I'll experiment with this once I've committed my latest patch.

        Show
        Chris Male added a comment - We're actually going to need this sooner rather than later, in order to continue to support the {solr.*} suffix for analysis factories. Currently SolrResourceLoader just has a list a directories it tries when presented with the suffix. When we move all the factories into their directories, it won't be able to do that any more. I'll experiment with this once I've committed my latest patch.
        Hide
        Robert Muir added a comment -

        I will try to convince Uwe to look at the issue as well. He is the SPI expert, he
        will probably have some good ideas about the best approach.

        Worst case I see him tomorrow or something and hassle him about it.

        Show
        Robert Muir added a comment - I will try to convince Uwe to look at the issue as well. He is the SPI expert, he will probably have some good ideas about the best approach. Worst case I see him tomorrow or something and hassle him about it.
        Hide
        Ryan McKinley added a comment -

        On trunk (1335132) I am getting this error with windows Java(TM) SE Runtime Environment (build 1.6.0_31-b05)

        
        common.compile-core:
            [javac] Compiling 1 source file to C:\workspace\apache\lucene\solr\build\contrib\solr-analysis-extras\classes\java
            [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6
            [javac] C:\workspace\apache\lucene\solr\contrib\analysis-extras\src\java\org\apache\solr\schema\ICUCollationField.java:89: error: method setup in class ICUCollationField cannot be applied to given types;
            [javac]     setup(schema.getResourceLoader(), args);
            [javac]     ^
            [javac]   required: ResourceLoader,Map<String,String>
            [javac]   found: SolrResourceLoader,Map<String,String>
            [javac]   reason: actual argument SolrResourceLoader cannot be converted to ResourceLoader by method invocation conversion
            [javac] 1 error
            [javac] 1 warning
        

        switching SolrResourceLoader to depend on org.apache.lucene.analysis.util.ResourceLoader fixes it...

        I don't know why I see this, but not hudson

        Show
        Ryan McKinley added a comment - On trunk (1335132) I am getting this error with windows Java(TM) SE Runtime Environment (build 1.6.0_31-b05) common.compile-core: [javac] Compiling 1 source file to C:\workspace\apache\lucene\solr\build\contrib\solr-analysis-extras\classes\java [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] C:\workspace\apache\lucene\solr\contrib\analysis-extras\src\java\org\apache\solr\schema\ICUCollationField.java:89: error: method setup in class ICUCollationField cannot be applied to given types; [javac] setup(schema.getResourceLoader(), args); [javac] ^ [javac] required: ResourceLoader,Map< String , String > [javac] found: SolrResourceLoader,Map< String , String > [javac] reason: actual argument SolrResourceLoader cannot be converted to ResourceLoader by method invocation conversion [javac] 1 error [javac] 1 warning switching SolrResourceLoader to depend on org.apache.lucene.analysis.util.ResourceLoader fixes it... I don't know why I see this, but not hudson
        Hide
        Steve Rowe added a comment -

        Ryan, compilation succeeds for me (at svn r1335131) on Win7+Cygwin, w/ Oracle JDK 1.6.0_21 - I don't see the same failure as you.

        Maybe 'ant clean' will help?

        Show
        Steve Rowe added a comment - Ryan, compilation succeeds for me (at svn r1335131) on Win7+Cygwin, w/ Oracle JDK 1.6.0_21 - I don't see the same failure as you. Maybe 'ant clean' will help?
        Hide
        Ryan McKinley added a comment -

        my bad – i was running ant compile from the solr folder and needed to run clean from up a level.

        Show
        Ryan McKinley added a comment - my bad – i was running ant compile from the solr folder and needed to run clean from up a level.
        Hide
        Chris Male added a comment -

        Patch which moves MultiTermComponent to the analysis module. Hopefully this is the last thing to move.

        Tests pass.

        Show
        Chris Male added a comment - Patch which moves MultiTermComponent to the analysis module. Hopefully this is the last thing to move. Tests pass.
        Hide
        Robert Muir added a comment -

        I dont like the type-unsafety here... shouldnt it be at least AbstractAnalysisFactory (instead of Object)?

        Show
        Robert Muir added a comment - I dont like the type-unsafety here... shouldnt it be at least AbstractAnalysisFactory (instead of Object)?
        Hide
        Chris Male added a comment -

        I agree but I don't know anything about this functionality so I'll take your word for it.

        Show
        Chris Male added a comment - I agree but I don't know anything about this functionality so I'll take your word for it.
        Hide
        Chris Male added a comment -

        Looking at the usage of this functionality in FieldTypePluginLoader, yes it can definitely be changed to AbstractAnalysisFactory since the code just checks whether it's a TokenFilterFactory, TokenizerFactory or CharFilterFactory (and throws an error on any other type).

        I'll update the patch.

        Show
        Chris Male added a comment - Looking at the usage of this functionality in FieldTypePluginLoader, yes it can definitely be changed to AbstractAnalysisFactory since the code just checks whether it's a TokenFilterFactory, TokenizerFactory or CharFilterFactory (and throws an error on any other type). I'll update the patch.
        Hide
        Chris Male added a comment -

        New patch changing MultiTermComponent over to returning an AbstractAnalysisFactory.

        Show
        Chris Male added a comment - New patch changing MultiTermComponent over to returning an AbstractAnalysisFactory.
        Hide
        Chris Male added a comment -

        Bash file for moving all the non-deprecated non-solr specific factories.

        TODO:

        • Make patch correcting package names
        • Collapse functionality from BaseTokenTestCase into BaseTokenStreamTestCase
        • Add test classes to those being moved
        Show
        Chris Male added a comment - Bash file for moving all the non-deprecated non-solr specific factories. TODO: Make patch correcting package names Collapse functionality from BaseTokenTestCase into BaseTokenStreamTestCase Add test classes to those being moved
        Hide
        Chris Male added a comment -

        Patch which simplifies the factory tests. It removes BaseTokenTestCase, moves the tests over to using TEST_CURRENT_VERSION, and cleans up some of the init(args) code.

        Show
        Chris Male added a comment - Patch which simplifies the factory tests. It removes BaseTokenTestCase, moves the tests over to using TEST_CURRENT_VERSION, and cleans up some of the init(args) code.
        Hide
        Chris Male added a comment -

        Updated move script to include tests.

        Show
        Chris Male added a comment - Updated move script to include tests.
        Hide
        Hoss Man added a comment -

        bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

        Show
        Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
        Hide
        Uwe Schindler added a comment - - edited

        Fixed with merge of sub-task.

        We should open new issues for:

        • Fix the Factory API to be more useable for Lucene programmers from Hibernate, ES, and other projects needing configureable analyzers
        • Clean up solr/contrib/analysis-extras (this contrib contains no factory code anymore just some JARS + The ICUCollationKeyField). We can nuke the whole contrib now, users can simply extract the analysis/morfologic,icu,smarcn,... JARS and put into their Solr plugins folder. I would move the ICUCollationFieldType to core and add icu.jar there. Its already needed by other contribs, so deduplicating it by adding to core makes much sense
        • We should maybe add a script to Solr that downloads the JAR files of the former extra analysis from Maven/Ivy and install in plugin folder. No need to ship this seldom used stuff with the main WAR.
        • Automatism to create META-INF/services (also affects Codecs&PostingsFormat). 2 possibilities (using APT and do it in JavaC using custom annotation while compiling, ASM-based ANT task like forbidden APIS to find all subclasses/implementations of an interface after compiling).
        Show
        Uwe Schindler added a comment - - edited Fixed with merge of sub-task. We should open new issues for: Fix the Factory API to be more useable for Lucene programmers from Hibernate, ES, and other projects needing configureable analyzers Clean up solr/contrib/analysis-extras (this contrib contains no factory code anymore just some JARS + The ICUCollationKeyField). We can nuke the whole contrib now, users can simply extract the analysis/morfologic,icu,smarcn,... JARS and put into their Solr plugins folder. I would move the ICUCollationFieldType to core and add icu.jar there. Its already needed by other contribs, so deduplicating it by adding to core makes much sense We should maybe add a script to Solr that downloads the JAR files of the former extra analysis from Maven/Ivy and install in plugin folder. No need to ship this seldom used stuff with the main WAR. Automatism to create META-INF/services (also affects Codecs&PostingsFormat). 2 possibilities (using APT and do it in JavaC using custom annotation while compiling, ASM-based ANT task like forbidden APIS to find all subclasses/implementations of an interface after compiling).
        Hide
        Steve Rowe added a comment -

        Solr tests have been failing under Maven on ASF Jenkins since the LUCENE-4044 commits on 7/25, because the POMs for two analysis modules (morfologik and phonetic) didn't include ${project.build.resources} definitions for src/resources/, the location of the SPI configuration files META-INF/services/o.a.l.analysis.util.*Factory.

        I've added src/resources/ to these two modules' POMs:

        • r1369961: trunk
        • r1369980: branch_4x
        Show
        Steve Rowe added a comment - Solr tests have been failing under Maven on ASF Jenkins since the LUCENE-4044 commits on 7/25, because the POMs for two analysis modules (morfologik and phonetic) didn't include ${project.build.resources } definitions for src/resources/ , the location of the SPI configuration files META-INF/services/o.a.l.analysis.util.*Factory . I've added src/resources/ to these two modules' POMs: r1369961: trunk r1369980: branch_4x

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development