Solr
  1. Solr
  2. SOLR-3105

Add analysis configurations for different languages to the example

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      I think we should have good baseline configurations for our supported analyzers
      so that its easy for people to get started.

      1. SOLR-3105.patch
        88 kB
        Robert Muir

        Activity

        Robert Muir created issue -
        Hide
        Robert Muir added a comment -

        Attached is a patch.

        First you must run 'ant sync-analyzers' (I would do this before committing), which syncs resources such as stoplists from the lucene analyzers into the conf/lang folder.

        While reviewing the configurations (Testing is not done, this is just a preliminary patch), I found some issues/opportunities for improvement in some of the Lucene Analyzers too (this patch uses those same definitions), so those are folded into the patch.

        Show
        Robert Muir added a comment - Attached is a patch. First you must run 'ant sync-analyzers' (I would do this before committing), which syncs resources such as stoplists from the lucene analyzers into the conf/lang folder. While reviewing the configurations (Testing is not done, this is just a preliminary patch), I found some issues/opportunities for improvement in some of the Lucene Analyzers too (this patch uses those same definitions), so those are folded into the patch.
        Robert Muir made changes -
        Field Original Value New Value
        Attachment SOLR-3105.patch [ 12513613 ]
        Hide
        Michael McCandless added a comment -

        Wow, this is AWESOME: 28 added languages. +1!

        Show
        Michael McCandless added a comment - Wow, this is AWESOME: 28 added languages. +1!
        Hide
        Steve Rowe added a comment -

        +1, nice!

        Question: you add new contraction lists for three languages to Solr's example, but shouldn't they go into the common analyzer's resources directory and be copied over by ant sync-analyzers?

        One other thing (separate issue probly): ElisionFilter is in package o.a.l.analysis.fr, but you've added example uses with Italian and Catalan - shouldn't this class move up to package o.a.l.analysis?

        Show
        Steve Rowe added a comment - +1, nice! Question: you add new contraction lists for three languages to Solr's example, but shouldn't they go into the common analyzer's resources directory and be copied over by ant sync-analyzers ? One other thing (separate issue probly): ElisionFilter is in package o.a.l.analysis.fr, but you've added example uses with Italian and Catalan - shouldn't this class move up to package o.a.l.analysis?
        Hide
        Robert Muir added a comment -

        Question: you add new contraction lists for three languages to Solr's example, but shouldn't they go into the common analyzer's resources directory and be copied over by ant sync-analyzers?

        Maybe, i put a TODO in those lists for that reason (its the first line in each one).
        The problem is, in most cases they are tiny short! so a text file is awkward. But, maybe we should just do this anyway.

        One other thing (separate issue probly): ElisionFilter is in package o.a.l.analysis.fr, but you've added example uses with Italian and Catalan - shouldn't this class move up to package o.a.l.analysis?

        Yeah its a little awkward: I think maybe it belongs in the .util package?

        Also, we don't sync the english stopwords, (though, it does match lucene's). So thats another improvement we could do, text-file those under .en package
        instead of being a hardwired set in StopAnalyzer.

        I think maybe we could open issues for all of these? I don't like it either, but i decided to go with the TODO approach because
        I'm not sure it should really block this issue (to the user, it will all be the same, this is implementation details).

        Show
        Robert Muir added a comment - Question: you add new contraction lists for three languages to Solr's example, but shouldn't they go into the common analyzer's resources directory and be copied over by ant sync-analyzers? Maybe, i put a TODO in those lists for that reason (its the first line in each one). The problem is, in most cases they are tiny short! so a text file is awkward. But, maybe we should just do this anyway. One other thing (separate issue probly): ElisionFilter is in package o.a.l.analysis.fr, but you've added example uses with Italian and Catalan - shouldn't this class move up to package o.a.l.analysis? Yeah its a little awkward: I think maybe it belongs in the .util package? Also, we don't sync the english stopwords, (though, it does match lucene's). So thats another improvement we could do, text-file those under .en package instead of being a hardwired set in StopAnalyzer. I think maybe we could open issues for all of these? I don't like it either, but i decided to go with the TODO approach because I'm not sure it should really block this issue (to the user, it will all be the same, this is implementation details).
        Hide
        Steve Rowe added a comment -

        I think maybe we could open issues for all of these? I don't like it either, but i decided to go with the TODO approach because I'm not sure it should really block this issue (to the user, it will all be the same, this is implementation details).

        +1

        Show
        Steve Rowe added a comment - I think maybe we could open issues for all of these? I don't like it either, but i decided to go with the TODO approach because I'm not sure it should really block this issue (to the user, it will all be the same, this is implementation details). +1
        Hide
        Christian Moen added a comment -

        This looks very good and makes it a whole lot easier for users to get started using the inherent language capabilities. Great work, Robert.

        Show
        Christian Moen added a comment - This looks very good and makes it a whole lot easier for users to get started using the inherent language capabilities. Great work, Robert.
        Hide
        Jan Høydahl added a comment -

        +1

        This is very welcome

        Show
        Jan Høydahl added a comment - +1 This is very welcome
        Hide
        Jan Høydahl added a comment -

        It would perhaps be cleaner to put all of these into a separate file and include via XInclude, to keep example schema.xml small(er), however it seems as XInclude for schema is broken - SOLR-3087 ?

        Show
        Jan Høydahl added a comment - It would perhaps be cleaner to put all of these into a separate file and include via XInclude, to keep example schema.xml small(er), however it seems as XInclude for schema is broken - SOLR-3087 ?
        Hide
        Robert Muir added a comment - - edited

        Jan: maybe, though I don't want the functionality to depend upon more obscure features of solr or xml.

        And of course, its useful to look at everything in context, these files are already huge:

        • the existing 'english' fieldTypes are 109 lines long, for a single language.
        • this is 317 lines long, for 28 languages.
        • other config files are also huge (solrconfig.xml is 1,669 lines long)

        Long term I would really prefer the field types in the schema.xml, where they will work and people will find them,
        and where we can build off of them for future things: e.g. things like better language detection integration or examples.

        Show
        Robert Muir added a comment - - edited Jan: maybe, though I don't want the functionality to depend upon more obscure features of solr or xml. And of course, its useful to look at everything in context, these files are already huge: the existing 'english' fieldTypes are 109 lines long, for a single language. this is 317 lines long, for 28 languages. other config files are also huge (solrconfig.xml is 1,669 lines long) Long term I would really prefer the field types in the schema.xml, where they will work and people will find them, and where we can build off of them for future things: e.g. things like better language detection integration or examples.
        Hide
        Yonik Seeley added a comment -

        It would perhaps be cleaner to put all of these into a separate file and include via XInclude, to keep example schema.xml small(er)

        Yeah, that might be nice.

        Show
        Yonik Seeley added a comment - It would perhaps be cleaner to put all of these into a separate file and include via XInclude, to keep example schema.xml small(er) Yeah, that might be nice.
        Hide
        Robert Muir added a comment -

        Sure, but then english fieldtypes go with it!

        Show
        Robert Muir added a comment - Sure, but then english fieldtypes go with it!
        Hide
        Hoss Man added a comment -

        I'm with robert ... this issue is about coming up with good example configs for as many languages as we can. at the moment we have one big fat kitchen-sink set of example configs, so lets use what we've got.

        If people care strongly, we can track cleaning up and re-organizing the examples (to use xinclude, or add multiple more specifically targed sets of example configs, etc...) in a different issue.

        Show
        Hoss Man added a comment - I'm with robert ... this issue is about coming up with good example configs for as many languages as we can. at the moment we have one big fat kitchen-sink set of example configs, so lets use what we've got. If people care strongly, we can track cleaning up and re-organizing the examples (to use xinclude, or add multiple more specifically targed sets of example configs, etc...) in a different issue.
        Hide
        Christian Moen added a comment -

        Hoss, +1.

        Show
        Christian Moen added a comment - Hoss, +1.
        Hide
        Yonik Seeley added a comment -

        at the moment we have one big fat kitchen-sink set of example configs

        They aren't necessarily supposed to be. It's case-by-case (and in the past I've routinely tried to clean out less useful or less widely applicable stuff).
        Although I'm not against this issue, but we shouldn't use logic like "if it's already bad, it doesn't matter if we make it worse".
        We should probably take a look at reducing some of the logging we spew at startup too (yes, in a separate issue).

        Show
        Yonik Seeley added a comment - at the moment we have one big fat kitchen-sink set of example configs They aren't necessarily supposed to be. It's case-by-case (and in the past I've routinely tried to clean out less useful or less widely applicable stuff). Although I'm not against this issue, but we shouldn't use logic like "if it's already bad, it doesn't matter if we make it worse". We should probably take a look at reducing some of the logging we spew at startup too (yes, in a separate issue).
        Hide
        Robert Muir added a comment -

        I dont think this is really adding to a kitchen sink or 'making the example worse'.

        Not trying to complain about the time here, but its not like i just quickly slapped a bunch of xml on a JIRA issue to bloat the config.

        Take a look at the patch.

        Show
        Robert Muir added a comment - I dont think this is really adding to a kitchen sink or 'making the example worse'. Not trying to complain about the time here, but its not like i just quickly slapped a bunch of xml on a JIRA issue to bloat the config. Take a look at the patch.
        Hide
        Mark Miller added a comment -

        Jan: maybe, though I don't want the functionality to depend upon more obscure features of solr or xml.

        The other issue is that this is not supported when loading config from zookeeper - so we would probably have to create another example set without. I'm not that familiar with XInclude, so perhaps support for zk could be added, but offhand i would assume that is not the case.

        Show
        Mark Miller added a comment - Jan: maybe, though I don't want the functionality to depend upon more obscure features of solr or xml. The other issue is that this is not supported when loading config from zookeeper - so we would probably have to create another example set without. I'm not that familiar with XInclude, so perhaps support for zk could be added, but offhand i would assume that is not the case.
        rmuir committed 1241878 (68 files)
        Reviews: none

        SOLR-3097, SOLR-3105: add fieldtypes for different languages to the example

        Lucene trunk
        rmuir committed 1241884 (72 files)
        Reviews: none

        SOLR-3097, SOLR-3105: add fieldtypes for different languages to the example

        Lucene branch_3x
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development