Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9163

Confusing solrconfig.xml in the downloaded solr*.zip

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2
    • Component/s: None
    • Labels:
      None

      Description

      Here are the solrconfig.xml when I download and unzip solr:

      find . -name solrconfig.xml
      ./solr-5.5.1/example/example-DIH/solr/db/conf/solrconfig.xml
      ./solr-5.5.1/example/example-DIH/solr/mail/conf/solrconfig.xml
      ./solr-5.5.1/example/example-DIH/solr/rss/conf/solrconfig.xml
      ./solr-5.5.1/example/example-DIH/solr/solr/conf/solrconfig.xml
      ./solr-5.5.1/example/example-DIH/solr/tika/conf/solrconfig.xml
      ./solr-5.5.1/example/files/conf/solrconfig.xml
      ./solr-5.5.1/server/solr/configsets/basic_configs/conf/solrconfig.xml
      ./solr-5.5.1/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml
      ./solr-5.5.1/server/solr/configsets/sample_techproducts_configs/conf/solrconfig.xml
      

      Most likely, the ones I want to use are in server/solr/configsets, I assume.
      But then which ones among those three?
      Searching online does not provide much detailed information.
      And diff-ing among them yields even more confusing results.
      Example: When I diff basic_configs/conf/solrconfig.xml with data_driven_schema_configs/conf/solrconfig.xml, I am not sure why the latter has these extra constrcuts?

      1. solr.LimitTokenCountFilterFactory and all the comments around it.
      2. deletionPolicy class="solr.SolrDeletionPolicy"
      3. Commented out infoStream file="INFOSTREAM.txt"
      4. Extra comments for "Update Related Event Listeners"
      5. indexReaderFactory
      6. And so for lots of other constructs and comments.

      The point is that it is difficult to find out exactly what extra features in the latter are making it data-driven. Hence it is difficult to know what features I am losing by not taking the data-driven-schema.

      It would be good to sync the above 3 files together (each file should have same comments and differ only in the configuration which makes them different). Also, some good documentation should be put online about them otherwise it is very confusing for non-committers and vanilla-users.

      1. SOLR-9163.patch
        101 kB
        Yonik Seeley
      2. SOLR-9163.patch
        214 kB
        Yonik Seeley

        Activity

        Hide
        varunthacker Varun Thacker added a comment -

        Indeed!

        I think the main problem here is that we have too many example configs. So over time when new features get added, there is no "rule" as to which configsets should be updated with an example of the feature/setting.

        Let's take the "techproducts" VS "data_driven" config set. I think they should only be different from each other in 3 things

        • "techproducts" should come with pre-defined fields which are part of the sample document set.
        • "techproducts" comes with a pre-defined "/browse" request handler
        • "data_driven" comes with a custom "add-unknown-fields-to-the-schema" update processor which makes the example configset schemaless.

        But like you said its pretty different currently and confusing.

        I feel we should just copy over the techproducts solrconfig to data_driven , remove "/browse" and add "add-unknown-fields-to-the-schema" .

        Using the start scripts to make use of APIs to add the extra configurations seems trappy as well ( in case we want one base config ). The config would then be tied to start scripts.

        Our tests uses inclusion - {{ <xi:include href="solrconfig.snippet.randomindexconfig.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/>}} . So maybe we could do something like this here to be able to share them better?

        Show
        varunthacker Varun Thacker added a comment - Indeed! I think the main problem here is that we have too many example configs. So over time when new features get added, there is no "rule" as to which configsets should be updated with an example of the feature/setting. Let's take the "techproducts" VS "data_driven" config set. I think they should only be different from each other in 3 things "techproducts" should come with pre-defined fields which are part of the sample document set. "techproducts" comes with a pre-defined "/browse" request handler "data_driven" comes with a custom "add-unknown-fields-to-the-schema" update processor which makes the example configset schemaless. But like you said its pretty different currently and confusing. I feel we should just copy over the techproducts solrconfig to data_driven , remove "/browse" and add "add-unknown-fields-to-the-schema" . Using the start scripts to make use of APIs to add the extra configurations seems trappy as well ( in case we want one base config ). The config would then be tied to start scripts. Our tests uses inclusion - {{ <xi:include href="solrconfig.snippet.randomindexconfig.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/>}} . So maybe we could do something like this here to be able to share them better?
        Hide
        varunthacker Varun Thacker added a comment -

        Another way they are different - the "text" field in techproducts is used in request handlers. There would be no "text" field in data_driven though so we'll need to think of something for that as well

        Show
        varunthacker Varun Thacker added a comment - Another way they are different - the "text" field in techproducts is used in request handlers. There would be no "text" field in data_driven though so we'll need to think of something for that as well
        Hide
        arafalov Alexandre Rafalovitch added a comment -

        I wrote about where all the different examples hide as well: http://blog.outerthoughts.com/2015/11/oh-solr-home-where-art-thou/ . It could be quite confusing.

        I guess the main issue is that nobody remembers which particular set of features are demonstrated in which example. So, the distinction is not super clean and grows fuzzier with each addition.

        And, then, of course, is the question of which configuration is to go into production with. Including issues like *enableRemoteStreaming* being true in all the configurations provided.

        Show
        arafalov Alexandre Rafalovitch added a comment - I wrote about where all the different examples hide as well: http://blog.outerthoughts.com/2015/11/oh-solr-home-where-art-thou/ . It could be quite confusing. I guess the main issue is that nobody remembers which particular set of features are demonstrated in which example. So, the distinction is not super clean and grows fuzzier with each addition. And, then, of course, is the question of which configuration is to go into production with. Including issues like * enableRemoteStreaming * being true in all the configurations provided.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I just ran into some of this craziness myself...
        I would have expected the differences between basic_configs and data_driven_schema_configs to only be what is necessary for "schemaless".

        It seems like to the degree possible, those configs should be identical.

        • the only difference in the schema should perhaps be the "copyField *" that's in the schemaless one? I don't like that copyField myself, but at least it's limited to the schemaless config.
        • the only difference in the solrconfig should be if add-unknown-fields-to-the-schema update processor is enabled or not (i.e. it should be defined in both).

        Everything else should be the same?
        Is there a way to use params.json or anything else to further confine the differences?
        Once we have sync'd these configsets they should be kept in sync.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - I just ran into some of this craziness myself... I would have expected the differences between basic_configs and data_driven_schema_configs to only be what is necessary for "schemaless". It seems like to the degree possible, those configs should be identical. the only difference in the schema should perhaps be the "copyField *" that's in the schemaless one? I don't like that copyField myself, but at least it's limited to the schemaless config. the only difference in the solrconfig should be if add-unknown-fields-to-the-schema update processor is enabled or not (i.e. it should be defined in both). Everything else should be the same? Is there a way to use params.json or anything else to further confine the differences? Once we have sync'd these configsets they should be kept in sync.
        Hide
        dsmiley David Smiley added a comment -

        Is there a way to use params.json or anything else to further confine the differences?

        Once we have sync'd these configsets they should be kept in sync.

        +1 if there is; I was just thinking params.json (config overlay). Then we could have a precommit check to ensure the files are the same?

        Show
        dsmiley David Smiley added a comment - Is there a way to use params.json or anything else to further confine the differences? Once we have sync'd these configsets they should be kept in sync. +1 if there is; I was just thinking params.json (config overlay). Then we could have a precommit check to ensure the files are the same?
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Draft patch attached.

        Here is the schema/solrconfig differences between the two configsets after:

        -<schema name="example-basic" version="1.6">
        +<schema name="example-data-driven-schema" version="1.6">
             <!-- attribute "name" is the name of this schema and is only used for display purposes.
                version="x.y" is Solr's version number for the schema syntax and 
                semantics.  It should not normally be changed by applications.
        @@ -124,7 +124,7 @@
         
             <!-- Only enabled in the "schemaless" data-driven example (assuming the client
                  does not know what fields may be searched) because it's very expensive to index everything twice. -->
        -    <!-- <copyField source="*" dest="_text_"/> -->
        +    <copyField source="*" dest="_text_"/>
        
         
        -  <!-- This enabled schemaless mode 
           <initParams path="/update/**">
             <lst name="defaults">
               <str name="update.chain">add-unknown-fields-to-the-schema</str>
             </lst>
           </initParams>
        -  -->
        
        Show
        yseeley@gmail.com Yonik Seeley added a comment - Draft patch attached. Here is the schema/solrconfig differences between the two configsets after: -<schema name= "example-basic" version= "1.6" > +<schema name= "example-data-driven-schema" version= "1.6" > <!-- attribute "name" is the name of this schema and is only used for display purposes. version= "x.y" is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. @@ -124,7 +124,7 @@ <!-- Only enabled in the "schemaless" data-driven example (assuming the client does not know what fields may be searched) because it's very expensive to index everything twice. --> - <!-- <copyField source= "*" dest= "_text_" /> --> + <copyField source= "*" dest= "_text_" /> - <!-- This enabled schemaless mode <initParams path= "/update/**" > <lst name= "defaults" > <str name= "update.chain" >add-unknown-fields-to-the-schema</str> </lst> </initParams> - -->
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        OK, full patch attached, essentially syncing the two configsets.
        I'll commit tomorrow as there haven't been any concerns/objections over this issue before.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - OK, full patch attached, essentially syncing the two configsets. I'll commit tomorrow as there haven't been any concerns/objections over this issue before.
        Hide
        dsmiley David Smiley added a comment -

        Yonik is this just the first step in addition to some mechanism for us to ensure they stay in sync? Without some mechanism, they will fall out of sync again.

        Show
        dsmiley David Smiley added a comment - Yonik is this just the first step in addition to some mechanism for us to ensure they stay in sync? Without some mechanism, they will fall out of sync again.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Yonik is this just the first step in addition to some mechanism for us to ensure they stay in sync?

        Maybe..., I don't know what the next step is, and I think manual syncing is the first step regardless.
        Hopefully at least committers will see that they are 99% the same and help keep them that way.
        I was just looking into params.json but they didn't quite work the way I thought. I had to specify params.json in the request to get it to work.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Yonik is this just the first step in addition to some mechanism for us to ensure they stay in sync? Maybe..., I don't know what the next step is, and I think manual syncing is the first step regardless. Hopefully at least committers will see that they are 99% the same and help keep them that way. I was just looking into params.json but they didn't quite work the way I thought. I had to specify params.json in the request to get it to work.
        Hide
        dsmiley David Smiley added a comment -

        Noble Paul would the "config overlay" thing work here? I mistakenly suggested params.json I think.

        Show
        dsmiley David Smiley added a comment - Noble Paul would the "config overlay" thing work here? I mistakenly suggested params.json I think.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Is everyone OK with committing this as a first (possibly only) step?
        I don't have further time to work on this right now.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Is everyone OK with committing this as a first (possibly only) step? I don't have further time to work on this right now.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 67b638880d81fbb11abfbfc1ec93a5f3d86c3d3b in lucene-solr's branch refs/heads/master from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=67b6388 ]

        SOLR-9163: sync basic_configs w/ data_driven_schema_configs

        Show
        jira-bot ASF subversion and git services added a comment - Commit 67b638880d81fbb11abfbfc1ec93a5f3d86c3d3b in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=67b6388 ] SOLR-9163 : sync basic_configs w/ data_driven_schema_configs
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1a53346c0e33956d0b568a78e8a3753bc58789c5 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1a53346 ]

        SOLR-9163: sync basic_configs w/ data_driven_schema_configs

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1a53346c0e33956d0b568a78e8a3753bc58789c5 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1a53346 ] SOLR-9163 : sync basic_configs w/ data_driven_schema_configs
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Committed.

        The duplication is a shame though...
        Longer term it feels like we should further collapse the two config-sets into one and have some sort of simple runtime switch for "schemaless"

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Committed. The duplication is a shame though... Longer term it feels like we should further collapse the two config-sets into one and have some sort of simple runtime switch for "schemaless"
        Hide
        hossman Hoss Man added a comment -

        I didn't notice this Jira until after yonik's commits.

        FWIW i think making "basic_configs" bigger – particularly with so much commented out example stuff, most of which refers to fields that don't even exist in the basic_configs schema – is a bad idea.

        The intent behind basic_configs was to be just that: a very basic set of configs. now instead of 2 large, kitchen-sink-esque, configsets (sample_techproducts and data_driven) we have 3 ... that doesn't feel like progress.

        Show
        hossman Hoss Man added a comment - I didn't notice this Jira until after yonik's commits. FWIW i think making "basic_configs" bigger – particularly with so much commented out example stuff, most of which refers to fields that don't even exist in the basic_configs schema – is a bad idea. The intent behind basic_configs was to be just that: a very basic set of configs. now instead of 2 large, kitchen-sink-esque, configsets (sample_techproducts and data_driven) we have 3 ... that doesn't feel like progress.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment - - edited

        commented out example stuff, most of which refers to fields that don't even exist in the basic_configs schema

        The schemas should be exactly the same now (except for the copyField).

        FWIW i think making "basic_configs" bigger [...] is a bad idea.

        I sort of had the same thought when syncing these up... but I modeled the basic after the schemaless (instead of vice-versa) because schemaless is what you get by default when you create a core, and I didn't want to go breaking examples in documentation.

        The intent behind basic_configs was to be just that: a very basic set of configs.

        Shouldn't schemaless just be about enabling that one feature?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - - edited commented out example stuff, most of which refers to fields that don't even exist in the basic_configs schema The schemas should be exactly the same now (except for the copyField). FWIW i think making "basic_configs" bigger [...] is a bad idea. I sort of had the same thought when syncing these up... but I modeled the basic after the schemaless (instead of vice-versa) because schemaless is what you get by default when you create a core, and I didn't want to go breaking examples in documentation. The intent behind basic_configs was to be just that: a very basic set of configs. Shouldn't schemaless just be about enabling that one feature?
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        FWIW, I'd be +1 on removing a lot of the cruft from both of the configs (and like I said, ideally just merging them and having a simple switch to turn on/off schemaless).

        Show
        yseeley@gmail.com Yonik Seeley added a comment - FWIW, I'd be +1 on removing a lot of the cruft from both of the configs (and like I said, ideally just merging them and having a simple switch to turn on/off schemaless).
        Hide
        hossman Hoss Man added a comment -

        The schemas should be exactly the same now (except for the copyField).

        Except that one of them (data_driven_schema) supports adding field automaticaly, while the other (basic) does not – so a bunch of commented out hunks of solrconfig.xml that give examples of how to do something with a "price" field is viable in a data_driven_schema config set, but nonsensical in the basic configs set.

        Shouldn't schemaless just be about enabling that one feature?

        yes, but:

        1. there is a lot of configuration involved in supporting a data_driven_schema collection (the various updated processors and what not) that are now cluttering up the "basic" configs
        2. that sounds like a reason to delete commented out sample cruft from data_driven, not add it to basic_configs...

        FWIW, I'd be +1 on removing a lot of the cruft from both of the configs ...

        +1

        Show
        hossman Hoss Man added a comment - The schemas should be exactly the same now (except for the copyField). Except that one of them (data_driven_schema) supports adding field automaticaly, while the other (basic) does not – so a bunch of commented out hunks of solrconfig.xml that give examples of how to do something with a "price" field is viable in a data_driven_schema config set, but nonsensical in the basic configs set. Shouldn't schemaless just be about enabling that one feature? yes, but: there is a lot of configuration involved in supporting a data_driven_schema collection (the various updated processors and what not) that are now cluttering up the "basic" configs that sounds like a reason to delete commented out sample cruft from data_driven, not add it to basic_configs... FWIW, I'd be +1 on removing a lot of the cruft from both of the configs ... +1
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            sachingoyal Sachin Goyal
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development