Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6952

Re-using data-driven configsets by default is not helpful

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.0
    • Fix Version/s: 5.0, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      When creating collections (I'm using the bin/solr scripts), I think we should automatically copy configsets, especially when running in "getting started mode" or data driven mode.

      I did the following:

      bin/solr create_collection -n foo
      bin/post foo some_data.csv
      

      I then created a second collection with the intention of sending in the same data, but this time run through a python script that changed a value from an int to a string (since it was an enumerated type) and was surprised to see that I got:

      Caused by: java.lang.NumberFormatException: For input string: "NA"
      at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
      at java.lang.Long.parseLong(Long.java:441)

      for my new version of the data that passes in a string instead of an int, as this new collection had only seen strings for that field.

      1. SOLR-6952.patch
        26 kB
        Timothy Potter
      2. SOLR-6952.patch
        14 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          noble.paul Noble Paul added a comment -

          Should it be a feature of the scripts or should it be an option in the Collection create?
          Now that we made the configsets mutable , it makes sense to make it a more accessible feature

          Show
          noble.paul Noble Paul added a comment - Should it be a feature of the scripts or should it be an option in the Collection create? Now that we made the configsets mutable , it makes sense to make it a more accessible feature
          Hide
          gsingers Grant Ingersoll added a comment -

          To work around this, I tried this from a clean install:

          1. bin/solr -cloud
          2. bin/solr create_collectioin foo
          3. bin/solr create_collection foo2

          I then indexed the data to foo using the ints and then followed up and indexed to foo2 using the Strings and much to my dismay, I got the same error and have come to find out that the configset is being shared. This is bad, IMO. At a minimum, data-driven configsets should be copied from the default template and we should never modify the base template for a specific instance. Not sure on the other ones, but my gut says we should copy, not modify.

          Show
          gsingers Grant Ingersoll added a comment - To work around this, I tried this from a clean install: bin/solr -cloud bin/solr create_collectioin foo bin/solr create_collection foo2 I then indexed the data to foo using the ints and then followed up and indexed to foo2 using the Strings and much to my dismay, I got the same error and have come to find out that the configset is being shared. This is bad, IMO. At a minimum, data-driven configsets should be copied from the default template and we should never modify the base template for a specific instance. Not sure on the other ones, but my gut says we should copy, not modify.
          Hide
          noble.paul Noble Paul added a comment -

          Keeping in mind ease the script by default should create a copy unless specified otherwise

          Show
          noble.paul Noble Paul added a comment - Keeping in mind ease the script by default should create a copy unless specified otherwise
          Hide
          thelabdude Timothy Potter added a comment - - edited

          How should the user specify they want to reuse a config that already exists in ZooKeeper instead of creating a new config in ZK by copying the template? The default behavior will copy the template and name the config the same name as the collection in ZK. Maybe something like a "-sharedConfig" option?

          bin/solr create_collection -n foo -sharedConfig data_driven_schema_configs
          

          This means to use the data_driven_schema_configs as-is in ZooKeeper and not copy it to a new config directory. I like making the "shared" concept explicit in the param / help for the command but open to other approaches too.

          Alternatively, we can change the interface to create_collection / create_core to use a -t parameter (t for template) and then make the -c optional, giving us:

          Example 1:

          bin/solr create_collection -n foo -t data_driven_schema_configs
          

          Result will be to copy the data_driven_schema_configs directory to ZooKeeper as /configs/foo

          Example 2:

          bin/solr create_collection -n foo -t data_driven_schema_configs -c shared
          

          Result will be to copy the data_driven_schema_configs directory to ZooKeeper as /configs/shared

          Of course, if /configs/shared already exists, then it will be used without uploading anything new ...

          Show
          thelabdude Timothy Potter added a comment - - edited How should the user specify they want to reuse a config that already exists in ZooKeeper instead of creating a new config in ZK by copying the template? The default behavior will copy the template and name the config the same name as the collection in ZK. Maybe something like a "-sharedConfig" option? bin/solr create_collection -n foo -sharedConfig data_driven_schema_configs This means to use the data_driven_schema_configs as-is in ZooKeeper and not copy it to a new config directory. I like making the "shared" concept explicit in the param / help for the command but open to other approaches too. Alternatively, we can change the interface to create_collection / create_core to use a -t parameter (t for template) and then make the -c optional, giving us: Example 1: bin/solr create_collection -n foo -t data_driven_schema_configs Result will be to copy the data_driven_schema_configs directory to ZooKeeper as /configs/foo Example 2: bin/solr create_collection -n foo -t data_driven_schema_configs -c shared Result will be to copy the data_driven_schema_configs directory to ZooKeeper as /configs/shared Of course, if /configs/shared already exists, then it will be used without uploading anything new ...
          Hide
          noble.paul Noble Paul added a comment -

          I would say , first we should add support for this in collection API with an extra request param. Collection API should copy a config to a new dir if that param is passed.

          The script should use that param ON by default . The reason is going forward config is editable, through configoverlay.json and params.json . So, shared configs are dangerous and unsuspecting users will not know why things are screwed up

          example I would prefer

          bin/solr create_collection -n foo -t data_driven_schema_configs -c -shareconfig
          
          Show
          noble.paul Noble Paul added a comment - I would say , first we should add support for this in collection API with an extra request param. Collection API should copy a config to a new dir if that param is passed. The script should use that param ON by default . The reason is going forward config is editable, through configoverlay.json and params.json . So, shared configs are dangerous and unsuspecting users will not know why things are screwed up example I would prefer bin/solr create_collection -n foo -t data_driven_schema_configs -c -shareconfig
          Hide
          thelabdude Timothy Potter added a comment -

          Collection API has nothing to do with loading a configuration into ZooKeeper. Currently, you use zkCli.sh/bat to load a configuration directory into ZooKeeper and when doing so, you can assign any name you want to the configuration directory that is uploaded. Since bin/solr is being fixed to handle copying vs. sharing by default, I don't think there are any changes needed to the Collection API.

          Show
          thelabdude Timothy Potter added a comment - Collection API has nothing to do with loading a configuration into ZooKeeper. Currently, you use zkCli.sh/bat to load a configuration directory into ZooKeeper and when doing so, you can assign any name you want to the configuration directory that is uploaded. Since bin/solr is being fixed to handle copying vs. sharing by default, I don't think there are any changes needed to the Collection API.
          Hide
          noble.paul Noble Paul added a comment - - edited

          Collection API has nothing to do with loading a configuration into ZooKeeper

          I know that. I meant to say that if someone is not using the script to create a collection ,(and using the http API) he misses the fun

          Show
          noble.paul Noble Paul added a comment - - edited Collection API has nothing to do with loading a configuration into ZooKeeper I know that. I meant to say that if someone is not using the script to create a collection ,(and using the http API) he misses the fun
          Hide
          thelabdude Timothy Potter added a comment -

          Here's a patch that implements the desired behavior. Easiest way to understand is to look at a few examples:

          Example 1

          bin/solr create -n foo
          

          Will upload the data_driven_schema_configs directory (the default) into ZooKeeper as /configs/foo, i.e. the data_driven_schema_configs "template" is copied to a unique config directory in ZooKeeper using the name of the collection you are creating.

          Example 2

          bin/solr create -n foo2 -t basic_configs -c SharedBasicSchema
          

          Will upload the basic_configs directory into ZooKeeper as /configs/SharedBasicSchema. If one wants to reuse the SharedBasicSchema configuration directory when creating another collection, they can just do: bin/solr create -n foo3 -c SharedBasicSchema

          If we're happy with this approach, I'll port over the changes to solr.cmd (for Windows)

          Show
          thelabdude Timothy Potter added a comment - Here's a patch that implements the desired behavior. Easiest way to understand is to look at a few examples: Example 1 bin/solr create -n foo Will upload the data_driven_schema_configs directory (the default) into ZooKeeper as /configs/foo, i.e. the data_driven_schema_configs "template" is copied to a unique config directory in ZooKeeper using the name of the collection you are creating. Example 2 bin/solr create -n foo2 -t basic_configs -c SharedBasicSchema Will upload the basic_configs directory into ZooKeeper as /configs/SharedBasicSchema. If one wants to reuse the SharedBasicSchema configuration directory when creating another collection, they can just do: bin/solr create -n foo3 -c SharedBasicSchema If we're happy with this approach, I'll port over the changes to solr.cmd (for Windows)
          Hide
          thelabdude Timothy Potter added a comment -

          Actually, since I'm tweaking the arg names of bin/solr create options, I think I'll just line them up with what was already being done in zkcli.sh. Specifically, I'm going to change the options to be:

          -c = name of collection or core to create (was -n)
          -d = configuration directory to copy (was -c)
          -n = configuration name (didn't exist)
          
          Show
          thelabdude Timothy Potter added a comment - Actually, since I'm tweaking the arg names of bin/solr create options, I think I'll just line them up with what was already being done in zkcli.sh. Specifically, I'm going to change the options to be: -c = name of collection or core to create (was -n) -d = configuration directory to copy (was -c) -n = configuration name (didn't exist)
          Hide
          noble.paul Noble Paul added a comment -

          What r the long names ?

          Show
          noble.paul Noble Paul added a comment - What r the long names ?
          Hide
          thelabdude Timothy Potter added a comment -

          same as zkcli.sh

          Show
          thelabdude Timothy Potter added a comment - same as zkcli.sh
          Hide
          thelabdude Timothy Potter added a comment -

          Here's an updated patch that changes around some of the parameter names to be consistent with the zkcli.sh script. I also tackled the "create" alias (SOLR-6933) in this patch since it was easier to address both issues with one patch.

          Example 1

          bin/solr create -c foo
          

          This is equivalent to doing:

          bin/solr create -c foo -d data_driven_schema_configs
          

          or

          bin/solr create -c foo -d data_driven_schema_configs -n foo
          

          The create action will upload the data_driven_schema_configs directory (the default) into ZooKeeper as /configs/foo, i.e. the data_driven_schema_configs "template" is copied to a unique config directory in ZooKeeper using the name of the collection you are creating.

          Example 2

          bin/solr create -c foo2 -d basic_configs -n SharedBasicSchema
          

          This will upload the basic_configs directory into ZooKeeper as /configs/SharedBasicSchema. If one wants to reuse the SharedBasicSchema configuration directory when creating another collection, they can just do:

          bin/solr create -c foo3 -n SharedBasicSchema
          

          Going to start porting these changes to the Windows solr.cmd, so please speak up now or this is what we'll have for 5.0

          Show
          thelabdude Timothy Potter added a comment - Here's an updated patch that changes around some of the parameter names to be consistent with the zkcli.sh script. I also tackled the "create" alias ( SOLR-6933 ) in this patch since it was easier to address both issues with one patch. Example 1 bin/solr create -c foo This is equivalent to doing: bin/solr create -c foo -d data_driven_schema_configs or bin/solr create -c foo -d data_driven_schema_configs -n foo The create action will upload the data_driven_schema_configs directory (the default) into ZooKeeper as /configs/foo, i.e. the data_driven_schema_configs "template" is copied to a unique config directory in ZooKeeper using the name of the collection you are creating. Example 2 bin/solr create -c foo2 -d basic_configs -n SharedBasicSchema This will upload the basic_configs directory into ZooKeeper as /configs/SharedBasicSchema. If one wants to reuse the SharedBasicSchema configuration directory when creating another collection, they can just do: bin/solr create -c foo3 -n SharedBasicSchema Going to start porting these changes to the Windows solr.cmd, so please speak up now or this is what we'll have for 5.0
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651231 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1651231 ]

          SOLR-6952: bin/solr create action should copy configset directory instead of reusing an existing configset in ZooKeeper by default; commit also includes fix for SOLR-6933 - create alias

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651231 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1651231 ] SOLR-6952 : bin/solr create action should copy configset directory instead of reusing an existing configset in ZooKeeper by default; commit also includes fix for SOLR-6933 - create alias
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651233 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1651233 ]

          SOLR-6952: bin/solr create action should copy configset directory instead of reusing an existing configset in ZooKeeper by default; commit also includes fix for SOLR-6933 - create alias

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651233 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651233 ] SOLR-6952 : bin/solr create action should copy configset directory instead of reusing an existing configset in ZooKeeper by default; commit also includes fix for SOLR-6933 - create alias
          Hide
          noble.paul Noble Paul added a comment -

          This has broken the blob store API

          The schema and config are automatically created by the system for .system collection

          There should be a way to create a colection without creating a configset

           bin/solr create -c .system -n .system
          
          Show
          noble.paul Noble Paul added a comment - This has broken the blob store API The schema and config are automatically created by the system for .system collection There should be a way to create a colection without creating a configset bin/solr create -c .system -n .system
          Hide
          thelabdude Timothy Potter added a comment -

          There should be a way to create a colection without creating a configset

          I disagree with that requirement. If something special is needed for .system I think we shouldn't expose that at the user interface level (which bin/solr create is).

          Show
          thelabdude Timothy Potter added a comment - There should be a way to create a colection without creating a configset I disagree with that requirement. If something special is needed for .system I think we shouldn't expose that at the user interface level (which bin/solr create is).
          Hide
          noble.paul Noble Paul added a comment -

          This is opened as a new ticket SOLR-7502

          Show
          noble.paul Noble Paul added a comment - This is opened as a new ticket SOLR-7502

            People

            • Assignee:
              thelabdude Timothy Potter
              Reporter:
              gsingers Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development