Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      We should be able to create a collection where sharding is done based on the value of a given field

      collections can be created with shardField=fieldName, which will be persisted in DocCollection in ZK

      implicit DocRouter would look at this field instead of shard field

      CompositeIdDocRouter can also use this field instead of looking at the id field.

        Issue Links

          Activity

          Hide
          Jack Krupansky added a comment -

          Some clarification is needed:

          1. Is this simply telling SolrCloud to use a different field for the key to be sharded? With no additional semantics?

          2. Or, is this saying that all documents with a particular value in that field will be guaranteed to be in the same shard (e.g., so that grouping works properly)?

          I'm hoping it is the latter.

          Thanks.

          Show
          Jack Krupansky added a comment - Some clarification is needed: 1. Is this simply telling SolrCloud to use a different field for the key to be sharded? With no additional semantics? 2. Or, is this saying that all documents with a particular value in that field will be guaranteed to be in the same shard (e.g., so that grouping works properly)? I'm hoping it is the latter. Thanks.
          Hide
          Noble Paul added a comment -

          Jack ,I think, I got you partially.

          Yes, docs with a same value in a field ,WILL go to the same shard

          In case of 'implicit' router there is a 1:1 mapping between the field value and the shard

          In case of compositeId router there wil be a n:1 mapping between the field value and the shard

          Show
          Noble Paul added a comment - Jack ,I think, I got you partially. Yes, docs with a same value in a field ,WILL go to the same shard In case of 'implicit' router there is a 1:1 mapping between the field value and the shard In case of compositeId router there wil be a n:1 mapping between the field value and the shard
          Hide
          Jack Krupansky added a comment -

          Does this proposal eliminate the need to do explicit routing in the key values?

          So, instead of having to say "my-value!key-value" for the key value when some other field already has "my-value" in it, I can just leave my key as "key-value" and with this proposal Solr would read that other field to get "my-value" and use it for sharding?

          Show
          Jack Krupansky added a comment - Does this proposal eliminate the need to do explicit routing in the key values? So, instead of having to say "my-value!key-value" for the key value when some other field already has "my-value" in it, I can just leave my key as "key-value" and with this proposal Solr would read that other field to get "my-value" and use it for sharding?
          Hide
          Jack Krupansky added a comment -

          Will SplitShard preserve the grouping by field value? I imagine it would, but...

          In other words, if an app uses a field to preserve grouping of similar documents on the same shard, SplitShard should preserve that grouping on a split, right?

          As long as the SplitShard code knows that it is supposed to used the specified alternative sharding field, things should be okay.

          Show
          Jack Krupansky added a comment - Will SplitShard preserve the grouping by field value? I imagine it would, but... In other words, if an app uses a field to preserve grouping of similar documents on the same shard, SplitShard should preserve that grouping on a split, right? As long as the SplitShard code knows that it is supposed to used the specified alternative sharding field, things should be okay.
          Hide
          Yonik Seeley added a comment -

          CompositeIdDocRouter can also use this field instead of looking at the id field.

          Agree - I could see by default, the compositeId router also paying attention to the _shard_ parameter (as the implicit router does).
          Even if the implicit router is configured to pay attention to a field other than _shard_ in the document, it should still use shard when looking at query parameters.

          This has some downsides thought too - related to splits and how to calculate the has (store the shard param when explicitly specified as a column? store the calculated hash as a column?)

          Does this proposal eliminate the need to do explicit routing in the key values?

          Not sure what you mean by "explicit routing" but if you mean the compositeId stuff, no. That has a lot of benefits and will remain the default.

          Show
          Yonik Seeley added a comment - CompositeIdDocRouter can also use this field instead of looking at the id field. Agree - I could see by default, the compositeId router also paying attention to the _shard_ parameter (as the implicit router does). Even if the implicit router is configured to pay attention to a field other than _shard_ in the document, it should still use shard when looking at query parameters. This has some downsides thought too - related to splits and how to calculate the has (store the shard param when explicitly specified as a column? store the calculated hash as a column?) Does this proposal eliminate the need to do explicit routing in the key values? Not sure what you mean by "explicit routing" but if you mean the compositeId stuff, no. That has a lot of benefits and will remain the default.
          Hide
          Noble Paul added a comment - - edited

          I could see by default, the compositeId router also paying attention to the _shard_ parameter

          the _shard_ parameter is the actual name of the shard. In case of compositeId router , the client is agnostic of the shard name and all that it cares about is shard.keys. What I mean to say is, the name _shard_ can be a bit confusing

          As of now we don't have a plan on how to do shard splitting for 'implicit' router. Let's keep it as TBD

          In case of compositeId router , I would like to read the part before the (!) to be read from the 'shardField'. The semantics will be exactly same as it is now. Reading the value from a request parameter would mean we will need to persist it along with the document in some field .

          Show
          Noble Paul added a comment - - edited I could see by default, the compositeId router also paying attention to the _shard_ parameter the _shard_ parameter is the actual name of the shard. In case of compositeId router , the client is agnostic of the shard name and all that it cares about is shard.keys. What I mean to say is, the name _shard_ can be a bit confusing As of now we don't have a plan on how to do shard splitting for 'implicit' router. Let's keep it as TBD In case of compositeId router , I would like to read the part before the (!) to be read from the 'shardField'. The semantics will be exactly same as it is now. Reading the value from a request parameter would mean we will need to persist it along with the document in some field .
          Hide
          Jack Krupansky added a comment -

          Not sure what you mean by "explicit routing"

          I mean where the user has placed a prefix and "Unable to render embedded object: File (" in front of a key value. Granted, it isn't explicitly stating the shard, and is really simply a "surrogate" key value to use for sharding. Is there better terminology for the fact that they used the ") not found." notation?

          Question for Noble: If a shard field is specified and there is a "!" on a document key, which takes precedence?

          Show
          Jack Krupansky added a comment - Not sure what you mean by "explicit routing" I mean where the user has placed a prefix and " Unable to render embedded object: File (" in front of a key value. Granted, it isn't explicitly stating the shard, and is really simply a "surrogate" key value to use for sharding. Is there better terminology for the fact that they used the ") not found. " notation? Question for Noble: If a shard field is specified and there is a "!" on a document key, which takes precedence?
          Hide
          Noble Paul added a comment -

          if a collection is created with the shardField value, it is a required param for all docs.If the field is null the document addition fails. No more lookup for "!" anymore.

          Show
          Noble Paul added a comment - if a collection is created with the shardField value, it is a required param for all docs.If the field is null the document addition fails. No more lookup for "!" anymore.
          Hide
          Yonik Seeley added a comment -

          the shard parameter is the actual name of the shard.

          For the implicit router. For a hash based router, it should be the value that is hashed to then lookup the shard based on ranges.

          In case of compositeId router , I would like to read the part before the to be read from the 'shardField'.

          I think it should work simpler... shard is used as the whole value to hash on for any hash based router.
          It's simple - if you want to have doc B have the exact same hash as doc A, then you give shard=A when adding doc B.

          I would like to read the part before the to be read from the 'shardField'.

          Perhaps that should be a different router... compositeField rather than compositeId.

          Show
          Yonik Seeley added a comment - the shard parameter is the actual name of the shard. For the implicit router. For a hash based router, it should be the value that is hashed to then lookup the shard based on ranges. In case of compositeId router , I would like to read the part before the to be read from the 'shardField'. I think it should work simpler... shard is used as the whole value to hash on for any hash based router. It's simple - if you want to have doc B have the exact same hash as doc A, then you give shard =A when adding doc B. I would like to read the part before the to be read from the 'shardField'. Perhaps that should be a different router... compositeField rather than compositeId.
          Hide
          Noble Paul added a comment -

          I think it should work simpler... shard is used as the whole value to hash on for any hash based router.

          Should the field based sharding be any less powerful than compositeId? Or do we want to have configure multiple fields like shardField=primaryShardFIeld,secondaryShardField instead of separating the values with a '/'

          Perhaps that should be a different router... compositeField rather than compositeId.

          Too many routers can be confusing to users. Essentially it is a hash router. The only difference is where the value is obtained for hashing. It could be from an 'id' ( which is the default) or it can be from a separate field. We probably should rename the CompositeIdRouter to HashRouter instead of having multiple routers doing slightly different things. In reality , it is not a CompositeFieldRouter, it is just a FieldHashRouter

          For the implicit router. For a hash based router, it should be the value that is hashed to then lookup the shard based on ranges.

          I understand that. I'm worried about the name. Should we rather not use the other parameter \'shard.keys across router names , query and update requests . It is very confusing to have these names behaving differently in different routers.

          I'm all for changing the param from _shard_ to \'shard.keys' and keeping it consistent between all routers

          Show
          Noble Paul added a comment - I think it should work simpler... shard is used as the whole value to hash on for any hash based router. Should the field based sharding be any less powerful than compositeId? Or do we want to have configure multiple fields like shardField=primaryShardFIeld,secondaryShardField instead of separating the values with a '/' Perhaps that should be a different router... compositeField rather than compositeId. Too many routers can be confusing to users. Essentially it is a hash router. The only difference is where the value is obtained for hashing. It could be from an 'id' ( which is the default) or it can be from a separate field. We probably should rename the CompositeIdRouter to HashRouter instead of having multiple routers doing slightly different things. In reality , it is not a CompositeFieldRouter, it is just a FieldHashRouter For the implicit router. For a hash based router, it should be the value that is hashed to then lookup the shard based on ranges. I understand that. I'm worried about the name. Should we rather not use the other parameter \'shard.keys across router names , query and update requests . It is very confusing to have these names behaving differently in different routers. I'm all for changing the param from _shard_ to \'shard.keys' and keeping it consistent between all routers
          Hide
          Erick Erickson added a comment -

          What can be accomplished by this that cannot be accomplished with the current syntax?

          Weighing in late, but scanning the comments, there's no case made for why this is a better thing than using the current ! syntax. From what I can see, simplistically it looks like putting what's on the left of the ! in its own field (not a nuanced statement....).

          And I'm neutral-to-negative on it without a compelling use-case that couldn't be handled by the current syntax, mostly from the
          perspective that I'd rather see "one true way" of accomplishing something than two that can get out of synch. And
          they will. I can imagine getting shard splitting, routing and all that stuff right in one but not the other.

          One place where it'll be easy to get wrong: Joel is working on routing from the client so updates go to the right leader. We'll
          have to put this logic in that code too.

          I'm not sure the functionality is worth the complication, but maybe that's just because routing gives me a headache.

          All of the complexifications I imagine can be addressed, but is it worth the effort? Without a compelling use-case for why I don't think so.

          FWIW,
          Erick

          Show
          Erick Erickson added a comment - What can be accomplished by this that cannot be accomplished with the current syntax? Weighing in late, but scanning the comments, there's no case made for why this is a better thing than using the current ! syntax. From what I can see, simplistically it looks like putting what's on the left of the ! in its own field (not a nuanced statement....). And I'm neutral-to-negative on it without a compelling use-case that couldn't be handled by the current syntax, mostly from the perspective that I'd rather see "one true way" of accomplishing something than two that can get out of synch. And they will. I can imagine getting shard splitting, routing and all that stuff right in one but not the other. One place where it'll be easy to get wrong: Joel is working on routing from the client so updates go to the right leader. We'll have to put this logic in that code too. I'm not sure the functionality is worth the complication, but maybe that's just because routing gives me a headache. All of the complexifications I imagine can be addressed, but is it worth the effort? Without a compelling use-case for why I don't think so. FWIW, Erick
          Hide
          Noble Paul added a comment -

          What can be accomplished by this that cannot be accomplished with the current syntax?

          • If I have a already working system where ids cannot be changed, I have no option with the current scheme of things .
          • What if I to have a clean 'id' value which is devoid of extra information? Should I do id.substring(id.indexOf("!") everytime I use it elsewhere ?

          One place where it'll be easy to get wrong....

          AFAIK everyone relies on the DocRouter to identify the right shard . If your code is using that API then your code should continue to work right

          Show
          Noble Paul added a comment - What can be accomplished by this that cannot be accomplished with the current syntax? If I have a already working system where ids cannot be changed, I have no option with the current scheme of things . What if I to have a clean 'id' value which is devoid of extra information? Should I do id.substring(id.indexOf("!") everytime I use it elsewhere ? One place where it'll be easy to get wrong.... AFAIK everyone relies on the DocRouter to identify the right shard . If your code is using that API then your code should continue to work right
          Hide
          Jack Krupansky added a comment -

          Hmmm... what happens when a document is updated and the value of this field changes? The update request would need to go to both the "new" shard to add the document, and the "old" shard to delete it, right?

          And for atomic update when the shard field is updated to a value that hashes to a different shard? The existing field values need to be read from the "old" shard and then all values written to the new shard?

          Show
          Jack Krupansky added a comment - Hmmm... what happens when a document is updated and the value of this field changes? The update request would need to go to both the "new" shard to add the document, and the "old" shard to delete it, right? And for atomic update when the shard field is updated to a value that hashes to a different shard? The existing field values need to be read from the "old" shard and then all values written to the new shard?
          Hide
          Jack Krupansky added a comment -

          there's no case made for why this is a better thing than using the current ! syntax

          Logically, I think it makes perfect sense to be able to declare what field should be used for "grouping" of documents, and that some apps want more of a functional grouping (e.g., by department or product category.) Having to manually (and forever) muck up the ID field values for routing always seemed rather odd to me. Maybe the latter has some utility on its own, but the former seems more sensible to me.

          And, then there is the issue of how to change the shard of an existing document that was, in terms I use, "explicitly routed", using the "Unable to render embedded object: File (" notation. I mean, if the ID of that document is referenced in other documents, all of those other documents would need to be manually updated as well. Before the introduction of the ") not found." notation, key values were completely application controlled, but with "Unable to render embedded object: File (", suddenly Solr interjects itself into the ID generation process. Some day... even Data Import Handler users are going to start flooding the Solr-user email list with questions about how to set and change routing and why key values containing ") not found." seem to be causing SolrCloud to be distributing documents to shards in an unexpected manner (because they didn't know about the "!" notation.)

          Show
          Jack Krupansky added a comment - there's no case made for why this is a better thing than using the current ! syntax Logically, I think it makes perfect sense to be able to declare what field should be used for "grouping" of documents, and that some apps want more of a functional grouping (e.g., by department or product category.) Having to manually (and forever) muck up the ID field values for routing always seemed rather odd to me. Maybe the latter has some utility on its own, but the former seems more sensible to me. And, then there is the issue of how to change the shard of an existing document that was, in terms I use, "explicitly routed", using the " Unable to render embedded object: File (" notation. I mean, if the ID of that document is referenced in other documents, all of those other documents would need to be manually updated as well. Before the introduction of the ") not found. " notation, key values were completely application controlled, but with " Unable to render embedded object: File (", suddenly Solr interjects itself into the ID generation process. Some day... even Data Import Handler users are going to start flooding the Solr-user email list with questions about how to set and change routing and why key values containing ") not found. " seem to be causing SolrCloud to be distributing documents to shards in an unexpected manner (because they didn't know about the "!" notation.)
          Hide
          Erick Erickson added a comment -

          bq: If I have a already working system where ids cannot be changed, I have no option with the current scheme of things .

          Do you have such a system? Theoretically I agree. But it also seems like this change has enough edge cases that it might be better to wait and see whether there's enough pressure to move this forward before trying to anticipate problems. Premature optimization?

          bq: If your code is using that API then your code should continue to work right...

          Don't really know, I've been meaning to dive into that patch but haven't. It's on the SolrJ side, mostly I'm using it as an example of a place things can get out of synch. I'm sure there are others.

          bq: What if I to have a clean 'id' value which is devoid of extra information? Should I do id.substring(id.indexOf("!") everytime I use it elsewhere ?

          Yeah, that's a pain. But perhaps not as much as trying to maintain two schemes to route documents and deal with the issues that are sure to come up. Frankly I don't have a firm sense of which is better/worse, my antenna are just quivering based on introducing a feature that'll have repercussions before there's a demonstrated need. I've gotten myself into trouble too often doing that...

          bq: what happens when a document is updated and the value of this field changes?

          This is exactly what I'm talking about, I'm afraid the edge cases will go on forever (or nearly). An N+1 kind of thing.

          All that said, I'm not totally against the idea. In fact I kind of wish a separate "routing field" was the way it was implemented in the first place. But did I think to suggest it when it first started to be implemented? Nooooooo.....

          But I fear at this point that having two ways of routing things around without a compelling existing use case will generate a lot of work, lots of ongoing maintenance and the effort could well be spent elsewhere in the near term.

          But since I'm not volunteering to do the work, I really don't have all that much to say.

          Show
          Erick Erickson added a comment - bq: If I have a already working system where ids cannot be changed, I have no option with the current scheme of things . Do you have such a system? Theoretically I agree. But it also seems like this change has enough edge cases that it might be better to wait and see whether there's enough pressure to move this forward before trying to anticipate problems. Premature optimization? bq: If your code is using that API then your code should continue to work right... Don't really know, I've been meaning to dive into that patch but haven't. It's on the SolrJ side, mostly I'm using it as an example of a place things can get out of synch. I'm sure there are others. bq: What if I to have a clean 'id' value which is devoid of extra information? Should I do id.substring(id.indexOf("!") everytime I use it elsewhere ? Yeah, that's a pain. But perhaps not as much as trying to maintain two schemes to route documents and deal with the issues that are sure to come up. Frankly I don't have a firm sense of which is better/worse, my antenna are just quivering based on introducing a feature that'll have repercussions before there's a demonstrated need. I've gotten myself into trouble too often doing that... bq: what happens when a document is updated and the value of this field changes? This is exactly what I'm talking about, I'm afraid the edge cases will go on forever (or nearly). An N+1 kind of thing. All that said, I'm not totally against the idea. In fact I kind of wish a separate "routing field" was the way it was implemented in the first place. But did I think to suggest it when it first started to be implemented? Nooooooo..... But I fear at this point that having two ways of routing things around without a compelling existing use case will generate a lot of work, lots of ongoing maintenance and the effort could well be spent elsewhere in the near term. But since I'm not volunteering to do the work, I really don't have all that much to say.
          Hide
          Yonik Seeley added a comment - - edited

          What if I to have a clean 'id' value which is devoid of extra information? Should I do id.substring(id.indexOf("!") everytime I use it elsewhere ?

          Why would you have to do that? If "!" appears in the ID field by accident sometimes, everything still works as expected with the compositeId router - that's why it's the default.

          edit: Oh, I think I see what you mean... you want to use the id unchanged as a foreign key. You could always store that as a separate field too. Anyway, I'm not arguing against using another field, but I do think it's the less common and more complex solution (given that you now need to provide that extra value everywhere).

          Show
          Yonik Seeley added a comment - - edited What if I to have a clean 'id' value which is devoid of extra information? Should I do id.substring(id.indexOf("!") everytime I use it elsewhere ? Why would you have to do that? If "!" appears in the ID field by accident sometimes, everything still works as expected with the compositeId router - that's why it's the default. edit: Oh, I think I see what you mean... you want to use the id unchanged as a foreign key. You could always store that as a separate field too. Anyway, I'm not arguing against using another field, but I do think it's the less common and more complex solution (given that you now need to provide that extra value everywhere).
          Hide
          Noble Paul added a comment -

          Do you have such a system? .....

          Yes. I had . The entire Aol mail system already has billions of documents where id is immutable and referenced in code. While I was there I hacked solr to a field based sharding scheme. A lot of users will not have that expertise or patience

          Don't really know, I've been meaning to dive into that patch but haven't.

          IIRC , SolJ consults the DocRouter to identify the target slice/leader .If future patches need it they too should.

          But I fear at this point that having two ways of routing things around

          We already have multiple ways of routing things after SOLR-4221 is in place (next release will have it . Custom Sharding does not have a 'mangled id' concept as of now. It is not going to impact anyone who is already using the current scheme with compositeId. You will need to create your cluster explicitly with that option (which will be new users) . We will solve any problems as we go along

          Premature optimization?

          This is not optimization. I'm just trying to be intuitive and user-friendly . AFAIK Almost all nosql systems do grouping on the basis of some field value .

          what happens when a document is updated and the value of this field changes?

          Good question. It should be dealt in exactly the same way 'id' updates are handled today

          Show
          Noble Paul added a comment - Do you have such a system? ..... Yes. I had . The entire Aol mail system already has billions of documents where id is immutable and referenced in code. While I was there I hacked solr to a field based sharding scheme. A lot of users will not have that expertise or patience Don't really know, I've been meaning to dive into that patch but haven't. IIRC , SolJ consults the DocRouter to identify the target slice/leader .If future patches need it they too should. But I fear at this point that having two ways of routing things around We already have multiple ways of routing things after SOLR-4221 is in place (next release will have it . Custom Sharding does not have a 'mangled id' concept as of now. It is not going to impact anyone who is already using the current scheme with compositeId. You will need to create your cluster explicitly with that option (which will be new users) . We will solve any problems as we go along Premature optimization? This is not optimization. I'm just trying to be intuitive and user-friendly . AFAIK Almost all nosql systems do grouping on the basis of some field value . what happens when a document is updated and the value of this field changes? Good question. It should be dealt in exactly the same way 'id' updates are handled today
          Hide
          Yonik Seeley added a comment -

          Having to manually (and forever) muck up the ID field values for routing always seemed rather odd to me.

          Having to specify extra information is what seems odd to me, and greatly complicates clients.
          Say I have a basic client that wants to do a simple get by id, or a simple delete by id. If the id no longer contains enough information to tell what shard it's on, we need to start broadcasting gets and deletes or something.

          Show
          Yonik Seeley added a comment - Having to manually (and forever) muck up the ID field values for routing always seemed rather odd to me. Having to specify extra information is what seems odd to me, and greatly complicates clients. Say I have a basic client that wants to do a simple get by id, or a simple delete by id. If the id no longer contains enough information to tell what shard it's on, we need to start broadcasting gets and deletes or something.
          Hide
          Yonik Seeley added a comment -

          It is very confusing to have these names behaving differently in different routers.

          Not sure I understand... we should definitely have the same parameters behaving in the same way across all the routers.
          _shard_ should work across all routers. I understand the naming issue though... (the fact that shard is just the input to the router, not the actual shard name unless you're using the implicit router). _shard_ hasn't even really been documented yet I don't think... it's possible we could change it to _routing_ or _route_

          Should we rather not use the other parameter \'shard.keys across router names , query and update requests .

          I think we should use the same parameter name for query requests too (i.e. deprecate "shard.keys")

          Show
          Yonik Seeley added a comment - It is very confusing to have these names behaving differently in different routers. Not sure I understand... we should definitely have the same parameters behaving in the same way across all the routers. _shard_ should work across all routers. I understand the naming issue though... (the fact that shard is just the input to the router, not the actual shard name unless you're using the implicit router). _shard_ hasn't even really been documented yet I don't think... it's possible we could change it to _routing_ or _route_ Should we rather not use the other parameter \'shard.keys across router names , query and update requests . I think we should use the same parameter name for query requests too (i.e. deprecate "shard.keys")
          Hide
          Noble Paul added a comment -

          I think we should use the same parameter name for query requests too (i.e. deprecate "shard.keys")

          Tha's it. I just wanted one parameter for routing either _shard_ or something else . lets use _route_ for all routers and deprecate shard.keys .

          Show
          Noble Paul added a comment - I think we should use the same parameter name for query requests too (i.e. deprecate "shard.keys") Tha's it. I just wanted one parameter for routing either _shard_ or something else . lets use _route_ for all routers and deprecate shard.keys .
          Hide
          Noble Paul added a comment -

          Having to specify extra information is what seems odd to me, and greatly complicates clients.

          We already pass extra info if the lookup is not by id. lookup by id is a small feature for a solr.

          Show
          Noble Paul added a comment - Having to specify extra information is what seems odd to me, and greatly complicates clients. We already pass extra info if the lookup is not by id. lookup by id is a small feature for a solr.
          Hide
          Yonik Seeley added a comment -

          > Perhaps that should be a different router... compositeField rather than compositeId.

          Too many routers can be confusing to users.

          Heh - my favorite argument. "confusing to users" can be trotted out in any context
          Too many options can be just as confusing... 3 routers with 5 options each vs 5 routers with 3 or whatever. Let's talk about the best option. If we have a good default and good documentation, confusion shouldn't enter the equation.

          As far as compositeId router goes, I'm not sure I care if we create a new compositeField router or if we add more parameters / functionality to compositeId. Giving the exact same _shard_ parameter should give the exact same hash code though - it shouldn't just be the first part of a composite id.

          Show
          Yonik Seeley added a comment - > Perhaps that should be a different router... compositeField rather than compositeId. Too many routers can be confusing to users. Heh - my favorite argument. "confusing to users" can be trotted out in any context Too many options can be just as confusing... 3 routers with 5 options each vs 5 routers with 3 or whatever. Let's talk about the best option. If we have a good default and good documentation, confusion shouldn't enter the equation. As far as compositeId router goes, I'm not sure I care if we create a new compositeField router or if we add more parameters / functionality to compositeId. Giving the exact same _shard_ parameter should give the exact same hash code though - it shouldn't just be the first part of a composite id.
          Hide
          Jack Krupansky added a comment -

          If the id no longer contains enough information to tell what shard it's on...

          Great point. Automatic routing needs to be able to work when presented with just the ID field. An atomic update is a great example - the shard field may not be available on the client.

          Better to just forever say that automatic routing needs to be based solely on the ID key value, and that if the app needs to use the value of another field for routing, they absolutely do need to use a "composite key" with the routing key prepended to the nominal key value.

          OTOH, maybe they might want to use some other subset of the key value for router, such as a product category that is a part of a SKU used as the ID key. I think the idea there is that this would be custom sharding that uses most of the logic of CompositeID routing, but just different logic for how to extract the routing key from the full ID key value.

          Manual or custom routing is another story. There, the user can use whatever contrived "rules" they want.

          Show
          Jack Krupansky added a comment - If the id no longer contains enough information to tell what shard it's on... Great point. Automatic routing needs to be able to work when presented with just the ID field. An atomic update is a great example - the shard field may not be available on the client. Better to just forever say that automatic routing needs to be based solely on the ID key value, and that if the app needs to use the value of another field for routing, they absolutely do need to use a "composite key" with the routing key prepended to the nominal key value. OTOH, maybe they might want to use some other subset of the key value for router, such as a product category that is a part of a SKU used as the ID key. I think the idea there is that this would be custom sharding that uses most of the logic of CompositeID routing, but just different logic for how to extract the routing key from the full ID key value. Manual or custom routing is another story. There, the user can use whatever contrived "rules" they want.
          Hide
          Noble Paul added a comment -

          Speaking of the the best option

          my 2 cents

          2 routers

          1) A HashDocRouter
          2) An ImplicitDocRouter (or is it ExplicitRouter)

          Both honors the shardField or (routeField) param . one uses the value verbatim whereas the other uses the hash of the field value

          HashDocRouter honors the special id format with "!" .

          _route_ param can be used and will be honored by all routers always in add/update/query/getbyid et al. HashDocRouter uses the hash of the value whereas ImplicitDocROuter uses the value verbatim

          Show
          Noble Paul added a comment - Speaking of the the best option my 2 cents 2 routers 1) A HashDocRouter 2) An ImplicitDocRouter (or is it ExplicitRouter) Both honors the shardField or (routeField) param . one uses the value verbatim whereas the other uses the hash of the field value HashDocRouter honors the special id format with "!" . _route_ param can be used and will be honored by all routers always in add/update/query/getbyid et al. HashDocRouter uses the hash of the value whereas ImplicitDocROuter uses the value verbatim
          Hide
          Yonik Seeley added a comment -

          2) An ImplicitDocRouter (or is it ExplicitRouter)

          It's implicit if the target shard is implicitly defined by what shard received the update.
          It's explicit if you give it an explicit value (which makes the name "implicit" kind of not-so-good at that point). We could change the name of that too if we want (and make it so that "implicit" still works as an alias for back compat).

          Show
          Yonik Seeley added a comment - 2) An ImplicitDocRouter (or is it ExplicitRouter) It's implicit if the target shard is implicitly defined by what shard received the update. It's explicit if you give it an explicit value (which makes the name "implicit" kind of not-so-good at that point). We could change the name of that too if we want (and make it so that "implicit" still works as an alias for back compat).
          Hide
          Jack Krupansky added a comment -

          ImplicitDocRouter

          I started referring to this as "manual routing", meaning that Solr cannot automatically figure out which shard a document is in unless the user manually/explicitly specifies the shard.

          Overall, I would say that we have this menu of routing techniques:

          1. Manual URL, specifying the shard URL or directing the request to the shard URL.
          2. Manual shard ID, specifying the shard ID/name as a parameter. SolrJ or the receiving node can look up the shard URL in clusterstate.
          3. Fully automatic, hashing the full, raw ID key value.
          4. Directed automatic or key-directed automatic, hashing the "!" prefix of the composite key value. (I called this "explicit routing" at one point.)
          5. Field-directed automatic, the proposal for using a non-ID field's value for the surrogate key to hash.

          As far as the atomic update issue for field-directed routing, there are three choices:

          1. Update request includes the specified alternative (non-ID) routing field.
          2. If not present, a "shard" parameter would be required, specifying either the shard ID or the surrogate key value to be hashed.
          3. If neither is present, an error.

          That still leaves the update issue of changing the field-directed key value. This is not just an atomic update issue - replacing the full document also has this problem, when the specified routing field value changes, which may mean that the updated document now belongs in another shard.

          Show
          Jack Krupansky added a comment - ImplicitDocRouter I started referring to this as "manual routing", meaning that Solr cannot automatically figure out which shard a document is in unless the user manually/explicitly specifies the shard. Overall, I would say that we have this menu of routing techniques: 1. Manual URL, specifying the shard URL or directing the request to the shard URL. 2. Manual shard ID, specifying the shard ID/name as a parameter. SolrJ or the receiving node can look up the shard URL in clusterstate. 3. Fully automatic, hashing the full, raw ID key value. 4. Directed automatic or key-directed automatic, hashing the "!" prefix of the composite key value. (I called this "explicit routing" at one point.) 5. Field-directed automatic, the proposal for using a non-ID field's value for the surrogate key to hash. As far as the atomic update issue for field-directed routing, there are three choices: 1. Update request includes the specified alternative (non-ID) routing field. 2. If not present, a "shard" parameter would be required, specifying either the shard ID or the surrogate key value to be hashed. 3. If neither is present, an error. That still leaves the update issue of changing the field-directed key value. This is not just an atomic update issue - replacing the full document also has this problem, when the specified routing field value changes, which may mean that the updated document now belongs in another shard.
          Hide
          ASF subversion and git services added a comment -
          Show
          ASF subversion and git services added a comment - Commit 1508968 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1508968 ] SOLR-4221 SOLR-4808 SOLR-5006 SOLR-5017 SOLR-4222
          Hide
          ASF subversion and git services added a comment -

          Commit 1508981 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1508981 ]

          SOLR-4221 SOLR-4808 SOLR-5006 SOLR-5017 SOLR-4222

          Show
          ASF subversion and git services added a comment - Commit 1508981 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1508981 ] SOLR-4221 SOLR-4808 SOLR-5006 SOLR-5017 SOLR-4222
          Hide
          Noble Paul added a comment -

          The issue fixes the case of 'implicit' router only . will resolve after the same is done for compositeId router too

          Show
          Noble Paul added a comment - The issue fixes the case of 'implicit' router only . will resolve after the same is done for compositeId router too
          Hide
          Jack Krupansky added a comment -

          It seems like there was a lot of discussion that was never resolved, and now the issue is marked as "fixed", with no discussion or summary of how the discussion points were addressed or resolved (or ignored!).

          A short summary would be nice.

          Show
          Jack Krupansky added a comment - It seems like there was a lot of discussion that was never resolved, and now the issue is marked as "fixed", with no discussion or summary of how the discussion points were addressed or resolved (or ignored!). A short summary would be nice.
          Hide
          Noble Paul added a comment -

          the issue was not completely resolved. compositeId router still does not honor the 'routeField' attribute

          Show
          Noble Paul added a comment - the issue was not completely resolved. compositeId router still does not honor the 'routeField' attribute
          Hide
          Noble Paul added a comment -

          It is now possible to create a collection with an extra parameter 'routeField' . 'implicit' router would look into that field for routing any document.The value of the field will be the name of the shard where it belongs to.

          If the collection is created with 'routeField' other routing params are not honored

          This deprecates the 'shard.keys' parameter for routing queries in favor of a parameter called 'route' . 'shard.keys' will continue to work for another release , though

          Show
          Noble Paul added a comment - It is now possible to create a collection with an extra parameter 'routeField' . 'implicit' router would look into that field for routing any document.The value of the field will be the name of the shard where it belongs to. If the collection is created with 'routeField' other routing params are not honored This deprecates the 'shard.keys' parameter for routing queries in favor of a parameter called ' route ' . 'shard.keys' will continue to work for another release , though
          Hide
          ASF subversion and git services added a comment -

          Commit 1510420 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1510420 ]

          updating CHANGES.txt regarding deprecation of shar.keys' param SOLR-5017

          Show
          ASF subversion and git services added a comment - Commit 1510420 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1510420 ] updating CHANGES.txt regarding deprecation of shar.keys' param SOLR-5017
          Hide
          ASF subversion and git services added a comment -

          Commit 1510421 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1510421 ]

          updating CHANGES.txt regarding deprecation of shar.keys' param SOLR-5017

          Show
          ASF subversion and git services added a comment - Commit 1510421 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1510421 ] updating CHANGES.txt regarding deprecation of shar.keys' param SOLR-5017
          Hide
          Noble Paul added a comment -

          suuports routeField in compositeId router

          Show
          Noble Paul added a comment - suuports routeField in compositeId router
          Hide
          ASF subversion and git services added a comment -

          Commit 1513356 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1513356 ]

          SOLR-5017 support for routeField in COmpositeId router also

          Show
          ASF subversion and git services added a comment - Commit 1513356 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1513356 ] SOLR-5017 support for routeField in COmpositeId router also
          Hide
          ASF subversion and git services added a comment -

          Commit 1513357 from Noble Paul in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1513357 ]

          SOLR-5017 support for routeField in COmpositeId router also

          Show
          ASF subversion and git services added a comment - Commit 1513357 from Noble Paul in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1513357 ] SOLR-5017 support for routeField in COmpositeId router also
          Hide
          Noble Paul added a comment -

          A parameter called 'routeField' is supported in both routers. If routeField is 'x' all documents inserted must have a value for the field 'x' .

          The semantics of querying will remain same _route_ param can be used to limit down the search to a given shard (s)

          Show
          Noble Paul added a comment - A parameter called 'routeField' is supported in both routers. If routeField is 'x' all documents inserted must have a value for the field 'x' . The semantics of querying will remain same _route_ param can be used to limit down the search to a given shard (s)
          Hide
          Jack Krupansky added a comment -

          Is this feature intended for both traditional Solr sharding as well as SolrCloud?

          If it is intended for SolrCloud as well, how does delete-by-id work, in the sense that the delete command does not include the field needed to determine routing?

          Show
          Jack Krupansky added a comment - Is this feature intended for both traditional Solr sharding as well as SolrCloud? If it is intended for SolrCloud as well, how does delete-by-id work, in the sense that the delete command does not include the field needed to determine routing?
          Hide
          Noble Paul added a comment -

          This is only for SolrCloud

          deleteById/getById would expect the param _route_ or shard.keys (deprecated) without which it will have to fan out a distributed request. it works without complaining but will be inefficient

          Show
          Noble Paul added a comment - This is only for SolrCloud deleteById/getById would expect the param _route_ or shard.keys (deprecated) without which it will have to fan out a distributed request. it works without complaining but will be inefficient
          Hide
          Shalin Shekhar Mangar added a comment -

          Shard splitting doesn't support collections configured with a hash router and routeField. I'll put up a test and fix.

          Show
          Shalin Shekhar Mangar added a comment - Shard splitting doesn't support collections configured with a hash router and routeField. I'll put up a test and fix.
          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close

            People

            • Assignee:
              Noble Paul
              Reporter:
              Noble Paul
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development