Solr
  1. Solr
  2. SOLR-8642

SOLR allows creation of collections with invalid names

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 5.5, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Some of my colleagues and I recently noticed that the CREATECOLLECTION API will create a collection even when invalid characters are present in the name.

      For example, consider the following reproduction case, which involves creating a collection with a space in its name:

      $ <clean checkout of SOLR master/trunk>
      $ bin/solr start -e cloud -noprompt
          ...
      $ curl -i -l -k -X GET "http://localhost:8983/solr/admin/collections?action=CREATE&name=getting+started&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=gettingstarted"
      HTTP/1.1 200 OK
      Content-Type: application/xml; charset=UTF-8
      Transfer-Encoding: chunked
      
      <?xml version="1.0" encoding="UTF-8"?>
      <response>
      <lst name="responseHeader"><int name="status">0</int><int name="QTime">299</int></lst><lst name="failure"><str>org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://127.0.1.1:8983/solr: Error CREATEing SolrCore 'getting started_shard2_replica2': Unable to create core [getting started_shard2_replica2] Caused by: Invalid core name: 'getting started_shard2_replica2' Names must consist entirely of periods, underscores and alphanumerics</str><str>org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://127.0.1.1:7574/solr: Error CREATEing SolrCore 'getting started_shard2_replica1': Unable to create core [getting started_shard2_replica1] Caused by: Invalid core name: 'getting started_shard2_replica1' Names must consist entirely of periods, underscores and alphanumerics</str><str>org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://127.0.1.1:7574/solr: Error CREATEing SolrCore 'getting started_shard1_replica1': Unable to create core [getting started_shard1_replica1] Caused by: Invalid core name: 'getting started_shard1_replica1' Names must consist entirely of periods, underscores and alphanumerics</str><str>org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://127.0.1.1:8983/solr: Error CREATEing SolrCore 'getting started_shard1_replica2': Unable to create core [getting started_shard1_replica2] Caused by: Invalid core name: 'getting started_shard1_replica2' Names must consist entirely of periods, underscores and alphanumerics</str></lst>
      </response>
      $ 
      $ curl -i -l -k -X GET "http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true"
      HTTP/1.1 200 OK
      Content-Type: application/json; charset=UTF-8
      Transfer-Encoding: chunked
      
      {
        "responseHeader":{
          "status":0,
          "QTime":6},
        "cluster":{
          "collections":{
           ...
            "getting started":{
              "replicationFactor":"2",
              "shards":{
                "shard1":{
                  "range":"80000000-ffffffff",
                  "state":"active",
                  "replicas":{}},
                "shard2":{
                  "range":"0-7fffffff",
                  "state":"active",
                  "replicas":{}}},
              "router":{"name":"compositeId"},
              "maxShardsPerNode":"2",
              "autoAddReplicas":"false",
              "znodeVersion":1,
              "configName":"gettingstarted"},
          "live_nodes":["127.0.1.1:8983_solr",
            "127.0.1.1:7574_solr"]}}
      

      The commands/responses above suggest that Solr creates the collection without checking the name. It then goes on to create the cores for the collection, which fails and returns the error seen above.

      I verified this by doing a curl -i -l -k "http://localhost:8983/solr/admin/cores; as expected the cores were not actually created. (This is probably thanks to Erick's work on SOLR-8308).

      This bug is a problem because it will create collections which can never be backed up with actual cores.

      Seems like the same name-verification that 8308 added to cores should also be applied to collections.

      1. SOLR-8642.patch
        12 kB
        Erick Erickson
      2. SOLR-8642.patch
        12 kB
        Jason Gerlowski
      3. SOLR-8642.patch
        12 kB
        Jason Gerlowski
      4. SOLR-8642.patch
        12 kB
        Jason Gerlowski

        Issue Links

          Activity

          Hide
          Jason Gerlowski added a comment -

          I plan on adding a patch to fix this issue later on today.

          Show
          Jason Gerlowski added a comment - I plan on adding a patch to fix this issue later on today.
          Hide
          Shawn Heisey added a comment -

          Technically speaking, we do not have any name enforcement (things like collections, cores, fields). We have a list of recommendations, but I do not know how well these are relayed in the documentation.

          I believe that there should be strict enforcement of valid characters in identifiers. See SOLR-8110.

          Show
          Shawn Heisey added a comment - Technically speaking, we do not have any name enforcement (things like collections, cores, fields). We have a list of recommendations , but I do not know how well these are relayed in the documentation. I believe that there should be strict enforcement of valid characters in identifiers. See SOLR-8110 .
          Hide
          Jason Gerlowski added a comment - - edited

          Oh hmm. That's a little surprising, given the error message that gets spat out by the create-core failure:

          "Invalid core name: 'getting started_shard2_replica2' Names must consist entirely of periods, underscores and alphanumerics"

          (text taken from description above, bolding added for emphasis)

          The message above seems to imply that this is a hard requirement, and not just a recommendation (for cores at least). This might be a recent development though. (edit: it appears this check/message is recent. Was added 2 days ago in SOLR-8308). Regardless of the current behavior, I share your opinion that these should be requirements.

          Show
          Jason Gerlowski added a comment - - edited Oh hmm. That's a little surprising, given the error message that gets spat out by the create-core failure: " Invalid core name : 'getting started_shard2_replica2' Names must consist entirely of periods, underscores and alphanumerics" (text taken from description above, bolding added for emphasis) The message above seems to imply that this is a hard requirement, and not just a recommendation (for cores at least). This might be a recent development though. (edit: it appears this check/message is recent. Was added 2 days ago in SOLR-8308 ). Regardless of the current behavior, I share your opinion that these should be requirements.
          Hide
          Shawn Heisey added a comment -

          Interesting. I was not aware that any enforcement had been added. Solr is updated constantly by people all over the world, though, so sometimes things slip in that I do not notice.

          Show
          Shawn Heisey added a comment - Interesting. I was not aware that any enforcement had been added. Solr is updated constantly by people all over the world, though, so sometimes things slip in that I do not notice.
          Hide
          Erick Erickson added a comment -

          Getting something half-done then failing when it's preventable is silly. re: the discussion in 8110 I don't even really see a reason to wait for collection names. We aren't breaking existing installs, just erroring out early on create statements.

          So in the Collections API, we have
          CREATE
          CREATEALIAS.

          I don't see any problem with taking the regex defined in SOLR-8308 and using it for name validation in the above two methods at least.

          Probably want to move it into some utility class.

          Show
          Erick Erickson added a comment - Getting something half-done then failing when it's preventable is silly. re: the discussion in 8110 I don't even really see a reason to wait for collection names. We aren't breaking existing installs, just erroring out early on create statements. So in the Collections API, we have CREATE CREATEALIAS. I don't see any problem with taking the regex defined in SOLR-8308 and using it for name validation in the above two methods at least. Probably want to move it into some utility class.
          Hide
          Jason Gerlowski added a comment -

          Agreed; it seems trappy to let this behavior stay as is.

          I planned on doing exactly what Erick suggested above, on master, and branch_5x. Was aiming to get to it when I get off work later.

          Do we also want to make sure these are requirements for the v2/ APIs as well? To be honest I'm not familiar enough with them to say how identifiers are treated there (they might already have strict-enforcement of identifiers). Or maybe the code path is shared, and there's really no distinction here. Shouldn't be tough to look up, but I can't get to it right now. Just thought I'd raise the question in the meantime.

          Show
          Jason Gerlowski added a comment - Agreed; it seems trappy to let this behavior stay as is. I planned on doing exactly what Erick suggested above, on master, and branch_5x. Was aiming to get to it when I get off work later. Do we also want to make sure these are requirements for the v2/ APIs as well? To be honest I'm not familiar enough with them to say how identifiers are treated there (they might already have strict-enforcement of identifiers). Or maybe the code path is shared, and there's really no distinction here. Shouldn't be tough to look up, but I can't get to it right now. Just thought I'd raise the question in the meantime.
          Hide
          Jason Gerlowski added a comment -

          First attempt at a fix.

          A few comments:

          • pulled the identifier-validation logic into a separate class (had to tweak the message a bit to make it more generic)
          • added tests to TestCollectionAPI.java. Not sure this is the best place, but it does the job. Still getting used to figuring out where tests belong.
          • haven't yet looked into whether this has any interaction with the v2 APIs.
          Show
          Jason Gerlowski added a comment - First attempt at a fix. A few comments: pulled the identifier-validation logic into a separate class (had to tweak the message a bit to make it more generic) added tests to TestCollectionAPI.java. Not sure this is the best place, but it does the job. Still getting used to figuring out where tests belong. haven't yet looked into whether this has any interaction with the v2 APIs.
          Hide
          Anshum Gupta added a comment -

          Thanks for the patch Jason.

          Here's some feedback:

          • Why don't we just throw SolrException from validateCoreName instead?
          • In CollectionsHandler, we don't need another wrapper method if we do what I've suggested in the point above.
          • Just a minor suggestion, can you log the core/collection name in square brackets?

          The rest looks fine to me.
          I am not really familiar with the v2 APIs at this point and haven't looked at what needs to change but I can take a look at that as well unless someone who's familiar with that confirms that it wouldn't impact v2.

          Show
          Anshum Gupta added a comment - Thanks for the patch Jason. Here's some feedback: Why don't we just throw SolrException from validateCoreName instead? In CollectionsHandler, we don't need another wrapper method if we do what I've suggested in the point above. Just a minor suggestion, can you log the core/collection name in square brackets? The rest looks fine to me. I am not really familiar with the v2 APIs at this point and haven't looked at what needs to change but I can take a look at that as well unless someone who's familiar with that confirms that it wouldn't impact v2.
          Hide
          Jason Gerlowski added a comment - - edited

          Thanks for the quick feedback Anshum.

          A few thoughts/responses:

          • I initially avoided throwing SolrException in SolrIdentifierValidator because I was wary of tying that method call to the HTTP response a user gets. Just wanted to avoid it from a separation of concerns perspective. But looking at the final product, it is kindof distasteful to need to wrap the method call everywhere to catch the IAE and rethrow a SolrException. I've changed this behavior per your suggestion.
          • Done
          • Done

          Also, I forgot to mention in my last comment that this is a trunk/master patch. That was probably assumed, but just wanted to clarify.

          Show
          Jason Gerlowski added a comment - - edited Thanks for the quick feedback Anshum. A few thoughts/responses: I initially avoided throwing SolrException in SolrIdentifierValidator because I was wary of tying that method call to the HTTP response a user gets. Just wanted to avoid it from a separation of concerns perspective. But looking at the final product, it is kindof distasteful to need to wrap the method call everywhere to catch the IAE and rethrow a SolrException. I've changed this behavior per your suggestion. Done Done Also, I forgot to mention in my last comment that this is a trunk/master patch. That was probably assumed, but just wanted to clarify.
          Hide
          Jason Gerlowski added a comment -

          Fixed a small mistake caught by the tests. All tests pass locally for me now.

          Show
          Jason Gerlowski added a comment - Fixed a small mistake caught by the tests. All tests pass locally for me now.
          Hide
          ASF subversion and git services added a comment -

          Commit f77feb718a4c93516ea65f4418514202206f2703 in lucene-solr's branch refs/heads/master from Erick Erickson
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f77feb7 ]

          SOLR-8642: SOLR allows creation of collections with invalid names

          Show
          ASF subversion and git services added a comment - Commit f77feb718a4c93516ea65f4418514202206f2703 in lucene-solr's branch refs/heads/master from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f77feb7 ] SOLR-8642 : SOLR allows creation of collections with invalid names
          Hide
          Erick Erickson added a comment -

          Final patch with CHANGES.txt

          Show
          Erick Erickson added a comment - Final patch with CHANGES.txt
          Hide
          ASF subversion and git services added a comment -

          Commit 43309a4e76a4ff63dc7ac1b31ace2ec7da777734 in lucene-solr's branch refs/heads/branch_5x from Erick Erickson
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=43309a4 ]

          SOLR-8642: SOLR allows creation of collections with invalid names

          Show
          ASF subversion and git services added a comment - Commit 43309a4e76a4ff63dc7ac1b31ace2ec7da777734 in lucene-solr's branch refs/heads/branch_5x from Erick Erickson [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=43309a4 ] SOLR-8642 : SOLR allows creation of collections with invalid names
          Hide
          Erick Erickson added a comment -

          Thanks Jason!

          Show
          Erick Erickson added a comment - Thanks Jason!
          Hide
          Jason Gerlowski added a comment - - edited

          So, two questions now that I've gotten a little distance from this:

          • This JIRA added verification to CREATECOLLECTION and CREATEALIAS. Should we also be adding this sort of verification to CREATESHARD? It takes a "shard" parameter that is probably subject to the same pitfalls that collection names are. Maybe this is a different issue, and it shouldn't be lumped in with collection names. Just wanted to bring it up. Happy to spin that discussion off into a separate JIRA if needed.
          • When I initially looked at the regex, I was surprised that hyphens ("-") aren't allowed. Seems like a common character to disallow. Does anyone know of a JIRA where I can read more about where the recommendations came from? Just curious to see how the recommendations arose.
          Show
          Jason Gerlowski added a comment - - edited So, two questions now that I've gotten a little distance from this: This JIRA added verification to CREATECOLLECTION and CREATEALIAS. Should we also be adding this sort of verification to CREATESHARD? It takes a "shard" parameter that is probably subject to the same pitfalls that collection names are. Maybe this is a different issue, and it shouldn't be lumped in with collection names. Just wanted to bring it up. Happy to spin that discussion off into a separate JIRA if needed. When I initially looked at the regex, I was surprised that hyphens ("-") aren't allowed. Seems like a common character to disallow. Does anyone know of a JIRA where I can read more about where the recommendations came from? Just curious to see how the recommendations arose.
          Hide
          Shawn Heisey added a comment -

          I was surprised that hyphens ("-") aren't allowed.

          I use hyphens in some of my core names, but only for the shards on one of my three indexes.

          If it makes sense to exclude them, then please don't let my mistake change that plan. I can change my names.

          Show
          Shawn Heisey added a comment - I was surprised that hyphens ("-") aren't allowed. I use hyphens in some of my core names, but only for the shards on one of my three indexes. If it makes sense to exclude them, then please don't let my mistake change that plan. I can change my names.
          Hide
          Jason Gerlowski added a comment - - edited

          I guess that's the question I'm actually asking: does rejecting names with hyphens make sense? I'm sure there was a reason that the Solr recommendations warned against using hyphens when they were initially written. Does anyone know what that rationale was, whether it's still valid, or where I could go to read up on it?

          I don't have anything for or against them in names personally. Just wanted to double-check (if I can) that we're not being unnecessarily restrictive.

          (Sorry for bringing this up after the fact by the way; probably should've looked into this before uploading my patch.)

          Show
          Jason Gerlowski added a comment - - edited I guess that's the question I'm actually asking: does rejecting names with hyphens make sense? I'm sure there was a reason that the Solr recommendations warned against using hyphens when they were initially written. Does anyone know what that rationale was, whether it's still valid, or where I could go to read up on it? I don't have anything for or against them in names personally. Just wanted to double-check (if I can) that we're not being unnecessarily restrictive. (Sorry for bringing this up after the fact by the way; probably should've looked into this before uploading my patch.)
          Hide
          Erick Erickson added a comment -

          bq: I guess that's the question I'm actually asking: does rejecting names with hyphens make sense?

          IMO, yes. It's just too easy to have the hyphens confused with the MUST_NOT operation. Sure, it may not be a problem with core specifications, but why risk it?

          bq: Should we also be adding this sort of verification to CREATESHARD?

          I should think so, we should do it under a separate JIRA though IMO.

          Show
          Erick Erickson added a comment - bq: I guess that's the question I'm actually asking: does rejecting names with hyphens make sense? IMO, yes. It's just too easy to have the hyphens confused with the MUST_NOT operation. Sure, it may not be a problem with core specifications, but why risk it? bq: Should we also be adding this sort of verification to CREATESHARD? I should think so, we should do it under a separate JIRA though IMO.
          Hide
          Jason Gerlowski added a comment -

          Thanks for the clarification Erick. I'll file a separate JIRA and upload a patch when I get off work today.

          Show
          Jason Gerlowski added a comment - Thanks for the clarification Erick. I'll file a separate JIRA and upload a patch when I get off work today.
          Hide
          Mark Miller added a comment -

          This type of change is a little scary in terms of back compat. There are lot of systems that could create collection automatically, for example, via a script and use collection aliases to juggle collections (eg create and remove them). They could easily be using things like hyphens (which I don't agree should ever have been decided as invalid anyway, but unrelated). A change like this in a point release should probably have just stuck to chars we know are an issue, especially with 6 so close to release.

          Show
          Mark Miller added a comment - This type of change is a little scary in terms of back compat. There are lot of systems that could create collection automatically, for example, via a script and use collection aliases to juggle collections (eg create and remove them). They could easily be using things like hyphens (which I don't agree should ever have been decided as invalid anyway, but unrelated). A change like this in a point release should probably have just stuck to chars we know are an issue, especially with 6 so close to release.
          Hide
          Yago Riveiro added a comment - - edited

          Hi,

          I can't believe that I can't use a hyphen to create my collections ... I have thousand of collection with hyphens and basically I have a automatic system that creates the collections on the fly, and codebase that relay in collection names.

          Sorry but this change can't be done without a API that allow rename a collection.

          I can't upgrade to 5.5 because I can't create collections. This kind of changes can't go in the middle of a major release. This enforcing should be optional.

          In 4.x someone decides that DocValues in disk doesn't make sense and deprecated it in the middle of a major release, 10T of data to optimize to wipe the Disk format to use de "default" and 3 month to do it without downtime. Now I can create collections because someone "decides" that hyphens are not allowed. (I use Solr since 3.x, no problems with hyphens).

          Sorry but this is annoying level 9999.

          Show
          Yago Riveiro added a comment - - edited Hi, I can't believe that I can't use a hyphen to create my collections ... I have thousand of collection with hyphens and basically I have a automatic system that creates the collections on the fly, and codebase that relay in collection names. Sorry but this change can't be done without a API that allow rename a collection. I can't upgrade to 5.5 because I can't create collections. This kind of changes can't go in the middle of a major release. This enforcing should be optional. In 4.x someone decides that DocValues in disk doesn't make sense and deprecated it in the middle of a major release, 10T of data to optimize to wipe the Disk format to use de "default" and 3 month to do it without downtime. Now I can create collections because someone "decides" that hyphens are not allowed. (I use Solr since 3.x, no problems with hyphens). Sorry but this is annoying level 9999.
          Hide
          Jason Gerlowski added a comment -

          Sorry to hear that this is causing so much trouble for you Yago. There's been a general consensus that this JIRA was a bit overzealous, and that hyphens should be allowed in collection/core/shard names. There was already a discussion of this on the related SOLR-8725. I'd recommend reading the discussion there.

          To summarize, that JIRA culminated in a patch which allows hyphens in names. Anshum Gupta merged this on master, and branch_6x. There was also a consensus that the change should be back-ported to the next 5x point release. But there hasn't been a 5x point release since then, so there's no immediate relief for your concern.

          I'm keeping tabs on this to make sure it gets in if/when the community decides on doing another 5x release. Or at least so I can bug a committer about looking at it.

          Hope that addresses some of your concerns, or at least brings this down to annoying level 9998 for you : ) Sorry again for the trouble.

          Show
          Jason Gerlowski added a comment - Sorry to hear that this is causing so much trouble for you Yago. There's been a general consensus that this JIRA was a bit overzealous, and that hyphens should be allowed in collection/core/shard names. There was already a discussion of this on the related SOLR-8725 . I'd recommend reading the discussion there. To summarize, that JIRA culminated in a patch which allows hyphens in names. Anshum Gupta merged this on master, and branch_6x. There was also a consensus that the change should be back-ported to the next 5x point release. But there hasn't been a 5x point release since then, so there's no immediate relief for your concern. I'm keeping tabs on this to make sure it gets in if/when the community decides on doing another 5x release. Or at least so I can bug a committer about looking at it. Hope that addresses some of your concerns, or at least brings this down to annoying level 9998 for you : ) Sorry again for the trouble.
          Hide
          Yago Riveiro added a comment - - edited

          This can be used to learn something that I did with linux some time ago. When we releases and API, we release legacy, because people will develop a codebase using it (this include the wrong behaviours).

          If the API is broken, people like me will be in troubles. This is the reason to see system calls with the same name and a number in the end and are deprecated like 10 years later.

          Improvements are good, And I believe that this is doing for a good reason, but without tools that allow people to migrate from older behaviours are not useful.

          Solr should have an LTS version, or at least don't introduce BC in a major release. It's not the first time that I pass for this situation, and every time that I need to explain to my boss that something is broken in our current version but we can't upgrade because other thing is broken in next version, I feel his assassin instinct

          Annoying level to 9997

          Show
          Yago Riveiro added a comment - - edited This can be used to learn something that I did with linux some time ago. When we releases and API, we release legacy, because people will develop a codebase using it (this include the wrong behaviours). If the API is broken, people like me will be in troubles. This is the reason to see system calls with the same name and a number in the end and are deprecated like 10 years later. Improvements are good, And I believe that this is doing for a good reason, but without tools that allow people to migrate from older behaviours are not useful. Solr should have an LTS version, or at least don't introduce BC in a major release. It's not the first time that I pass for this situation, and every time that I need to explain to my boss that something is broken in our current version but we can't upgrade because other thing is broken in next version, I feel his assassin instinct Annoying level to 9997
          Hide
          harcor added a comment -

          I agree with Yago, you can not just change the pattern for the collection name at will. I am sure the "space" in the name is a good idea to enforce but a "word delimiter" like a dash or underscore should be valid.

          Show
          harcor added a comment - I agree with Yago, you can not just change the pattern for the collection name at will. I am sure the "space" in the name is a good idea to enforce but a "word delimiter" like a dash or underscore should be valid.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Jason Gerlowski
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development