Solr
  1. Solr
  2. SOLR-3862

add "remove" as update option for atomically removing a value from a multivalued field

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.9, Trunk
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Currently you can atomically "add" a value to a multivalued field. It would be useful to be able to "remove" a value from a multivalued field.

      When you "set" a multivalued field to null, it destroys all values.

      1. SOLR-3862-4.patch
        9 kB
        Alaknantha
      2. SOLR-3862-3.patch
        7 kB
        Steven Bower
      3. SOLR-3862-2.patch
        7 kB
        Steven Bower
      4. SOLR-3862.patch
        2 kB
        Jim Musil
      5. SOLR-3862.patch
        7 kB
        Alaknantha
      6. SOLR-3862.patch
        12 kB
        Erick Erickson
      7. SOLR-3862.patch
        14 kB
        Erick Erickson
      8. SOLR-3862.patch
        13 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          Otis Gospodnetic added a comment -

          I think Michael Garski and somebody else have a separate, existing JIRA issue for this...

          Show
          Otis Gospodnetic added a comment - I think Michael Garski and somebody else have a separate, existing JIRA issue for this...
          Hide
          Michael Garski added a comment -

          I believe you may be thinking of someone else Otis - I have not done anything with updatable fields.

          Show
          Michael Garski added a comment - I believe you may be thinking of someone else Otis - I have not done anything with updatable fields.
          Hide
          Hoss Man added a comment -

          something to consider is the semantics: should it be "remove by position" in the list, or should it be "remove by value" and if it's "remove by value" then what happens if the value occurs multiple times in the list?

          Show
          Hoss Man added a comment - something to consider is the semantics: should it be "remove by position" in the list, or should it be "remove by value" and if it's "remove by value" then what happens if the value occurs multiple times in the list?
          Hide
          Jim Musil added a comment - - edited

          Personally, I'd like it to be remove by value. Even better, would be remove by regex.

          I think the most intuitive method would be to remove globally from the list.

          Show
          Jim Musil added a comment - - edited Personally, I'd like it to be remove by value. Even better, would be remove by regex. I think the most intuitive method would be to remove globally from the list.
          Hide
          Jim Musil added a comment -

          So, I needed to hack this together to suit my needs. I'm not sure if anyone else would find this useful, but I've added a "removeAll" option for atomic updates. It just uses String.removeAll() on each value, so supplying regex as the value will work. I've never submitted a patch before, so please forgive me if I've done this wrong.

          Show
          Jim Musil added a comment - So, I needed to hack this together to suit my needs. I'm not sure if anyone else would find this useful, but I've added a "removeAll" option for atomic updates. It just uses String.removeAll() on each value, so supplying regex as the value will work. I've never submitted a patch before, so please forgive me if I've done this wrong.
          Hide
          Jim Musil added a comment -

          first stab

          Show
          Jim Musil added a comment - first stab
          Hide
          Andreas Kleiber added a comment -

          I need this aswell. We have a message based system to generate / update our solr index. So an option to remove an entry of a multiValued field by value is very nice.
          I would appreciate if this will be merged.

          Show
          Andreas Kleiber added a comment - I need this aswell. We have a message based system to generate / update our solr index. So an option to remove an entry of a multiValued field by value is very nice. I would appreciate if this will be merged.
          Hide
          Steven Bower added a comment -

          This is great... I think we should change the operation to "remove" instead of "removeAll" .. while removeAll is very java-esq i think it would be easier for more people to understand to just have it be "remove"...

          Show
          Steven Bower added a comment - This is great... I think we should change the operation to "remove" instead of "removeAll" .. while removeAll is very java-esq i think it would be easier for more people to understand to just have it be "remove"...
          Hide
          Steven Bower added a comment -

          Actually maybe I'm understanding this feature slightly differently. Is the intent to be able to remove patterns from field values or is the goal to be able to remove entire values? This patch kind of does both, however I think that for the case where you might have many field values per field or a large list of values to remove it may be more efficient to forgo the regex stuff and do straight "remove". Additionally for the regex case maybe having a "replaceAll" type function that would let you replace a regex within a field value would be more generalized functionality (and then remove the value if the new string is empty).

          some thoughts.. I will post a more tightened up "remove" patch tomorrow..

          Show
          Steven Bower added a comment - Actually maybe I'm understanding this feature slightly differently. Is the intent to be able to remove patterns from field values or is the goal to be able to remove entire values? This patch kind of does both, however I think that for the case where you might have many field values per field or a large list of values to remove it may be more efficient to forgo the regex stuff and do straight "remove". Additionally for the regex case maybe having a "replaceAll" type function that would let you replace a regex within a field value would be more generalized functionality (and then remove the value if the new string is empty). some thoughts.. I will post a more tightened up "remove" patch tomorrow..
          Hide
          Steven Bower added a comment -

          This patch works in a slightly more restrictive way. It requires an entire field value to match the pattern specified for it to be removed. Also uses a precompiled pattern in order to prevent recompilation for each field value (esp bad with lots of field values). It also renames the option "remove" per my previous comment.

          I think it would be useful to add a "replace" option that allows for arbitrary replacement of text in a field (or removal if it results in an empty string, like the initial patch here). The only issue is how would one specify the "replacement" text as part of the update.

          Show
          Steven Bower added a comment - This patch works in a slightly more restrictive way. It requires an entire field value to match the pattern specified for it to be removed. Also uses a precompiled pattern in order to prevent recompilation for each field value (esp bad with lots of field values). It also renames the option "remove" per my previous comment. I think it would be useful to add a "replace" option that allows for arbitrary replacement of text in a field (or removal if it results in an empty string, like the initial patch here). The only issue is how would one specify the "replacement" text as part of the update.
          Hide
          Steven Bower added a comment -

          oh also the patch has a simple unit test now that exercises this new functionality plus the existing add/set functionality.

          Show
          Steven Bower added a comment - oh also the patch has a simple unit test now that exercises this new functionality plus the existing add/set functionality.
          Hide
          Steven Bower added a comment -

          Minor update removing a spurious println

          Show
          Steven Bower added a comment - Minor update removing a spurious println
          Hide
          Alaknantha added a comment -

          The original patches have a bug when invoked in the zoo keeper mode. The input field value comes in as a list where as bypassing ZK and directly hitting Solr, it comes in as a String to the patch code seqment to handle the atomic updates. Fixed the bug in the patch to use Collection.removeAll to remove the list A from list B instead of using regular expression matching and it works correctly.

          Show
          Alaknantha added a comment - The original patches have a bug when invoked in the zoo keeper mode. The input field value comes in as a list where as bypassing ZK and directly hitting Solr, it comes in as a String to the patch code seqment to handle the atomic updates. Fixed the bug in the patch to use Collection.removeAll to remove the list A from list B instead of using regular expression matching and it works correctly.
          Hide
          Alaknantha added a comment -

          SOLR-3862-4.patch is the patch with the fix.

          Show
          Alaknantha added a comment - SOLR-3862 -4.patch is the patch with the fix.
          Hide
          Erick Erickson added a comment -

          Does anyone have any objections for committing this? I'm running precommit and tests this evening, if there are no objections I'll commit this early next week.

          I took a pretty quick look at the code and it seems OK, but I'd love to have someone who knows the code better take a look.

          I'll also put up a patch without the commented-out code that looks like a leftover.

          One note. It's a bit easier if people always put up a patch with the same name, SOLR-3862.patch in this case. Only the most recent one will be blue, the rest will be gray. No big deal, just for future reference.

          Show
          Erick Erickson added a comment - Does anyone have any objections for committing this? I'm running precommit and tests this evening, if there are no objections I'll commit this early next week. I took a pretty quick look at the code and it seems OK, but I'd love to have someone who knows the code better take a look. I'll also put up a patch without the commented-out code that looks like a leftover. One note. It's a bit easier if people always put up a patch with the same name, SOLR-3862 .patch in this case. Only the most recent one will be blue, the rest will be gray. No big deal, just for future reference.
          Hide
          Yonik Seeley added a comment - - edited

          Does anyone have any objections for committing this?

          Is it finished? It doesn't look like there are tests for all the functionality. The debug logging statements should probably go as well.
          Also, it's nice to know what is being committed (as in, what is the API?) to enable feedback without having to parse the code to figure it out.

          Show
          Yonik Seeley added a comment - - edited Does anyone have any objections for committing this? Is it finished? It doesn't look like there are tests for all the functionality. The debug logging statements should probably go as well. Also, it's nice to know what is being committed (as in, what is the API?) to enable feedback without having to parse the code to figure it out.
          Hide
          Erick Erickson added a comment -

          Alaknantha Can you address Yonik's points please?

          Show
          Erick Erickson added a comment - Alaknantha Can you address Yonik's points please?
          Hide
          Alaknantha added a comment -

          Since we are only interested in the "remove" functionality, I removed "replace" code segment from this patch. The JUnit Test case "AtomicUpdatesTest.java" tests the newly added "remove" functionality and the existing "set" and "add" also.
          Could you please let me know where to update the API documentation and I will do that? I am attaching the updated patch SOLR-3862.patch

          Show
          Alaknantha added a comment - Since we are only interested in the "remove" functionality, I removed "replace" code segment from this patch. The JUnit Test case "AtomicUpdatesTest.java" tests the newly added "remove" functionality and the existing "set" and "add" also. Could you please let me know where to update the API documentation and I will do that? I am attaching the updated patch SOLR-3862 .patch
          Hide
          Yonik Seeley added a comment -

          Could you please let me know where to update the API documentation and I will do that?

          Updating actual documentation wouldn't be appropriate since nothing has been committed yet.

          My API comment was because Erick indicated he was going to commit a patch, but I couldn't tell what the intent of the patch was. It's hard to review a patch when you don't know what the patch is trying to do. I took a quick look at the code, and was confused by seeing regular expressions, and apparently two commands: "replace" and "remove".

          For something like this, implementation is easy to fix later, interface is much less so. So lets take a minute to talk about what interface we want, and give examples.

          Example of interface as I understand it (in JSON). It's just like the other "atomic update" field modifiers:

          A field like so will add a value (this is existing functionality):
           "cat"  : {"add":"Cyberpunk"}
          A field like so will remove all values that match the given pattern "Cyberpunk" (this patch implements):
           "cat"  : {"remove":"Cyberpunk"}
          Note that the specified value is a regular expression, so the following will remove all values starting with "Cyber"
           "cat"  : {"remove":"Cyber.*"}
          The regular expression must match the whole value for that value to be removed.
          
          Full command:
          curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
          [
           {"id"    : "book1",
            "cat"   : {"remove":"Cyberpunk"}
           }
          ]'
          

          Questions:

          • Do we want a way to specify the removal of multiple values?
            Perhaps "remove" : [ "A","B","C" ]
          • What are the downsides to using regex? Someone may not realize that the values being used are regular expressions until they are in production and values that accidentally have wildcards in them are used? Or they may simply forget to do wildcard escaping code since everything would "just work" until they did encounter them?
          • Perhaps we want a separate way to specify "value" vs "regex". I assume "value" will be a much more common usecase than regex (although I do like the power that regex brings).
          Show
          Yonik Seeley added a comment - Could you please let me know where to update the API documentation and I will do that? Updating actual documentation wouldn't be appropriate since nothing has been committed yet. My API comment was because Erick indicated he was going to commit a patch, but I couldn't tell what the intent of the patch was. It's hard to review a patch when you don't know what the patch is trying to do. I took a quick look at the code, and was confused by seeing regular expressions, and apparently two commands: "replace" and "remove". For something like this, implementation is easy to fix later, interface is much less so. So lets take a minute to talk about what interface we want, and give examples. Example of interface as I understand it (in JSON). It's just like the other "atomic update" field modifiers: A field like so will add a value ( this is existing functionality): "cat" : { "add" : "Cyberpunk" } A field like so will remove all values that match the given pattern "Cyberpunk" ( this patch implements ): "cat" : { "remove" : "Cyberpunk" } Note that the specified value is a regular expression, so the following will remove all values starting with "Cyber" "cat" : { "remove" : "Cyber.*" } The regular expression must match the whole value for that value to be removed. Full command: curl http: //localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ { "id" : "book1" , "cat" : { "remove" : "Cyberpunk" } } ]' Questions: Do we want a way to specify the removal of multiple values? Perhaps "remove" : [ "A","B","C" ] What are the downsides to using regex? Someone may not realize that the values being used are regular expressions until they are in production and values that accidentally have wildcards in them are used? Or they may simply forget to do wildcard escaping code since everything would "just work" until they did encounter them? Perhaps we want a separate way to specify "value" vs "regex". I assume "value" will be a much more common usecase than regex (although I do like the power that regex brings).
          Hide
          Alaknantha added a comment - - edited

          Yonik, Are you looking at the older patch? https://issues.apache.org/jira/secure/attachment/12637687/SOLR-3862.patch is my latest patch where I got rid of the regular expression usage.

            • 1) Do we want a way to specify the removal of multiple values?
              Perhaps "remove" : [ "A","B","C" ]
              • Yes, this patch supports removal of multiple values.
            • 2) What are the downsides to using regex? Someone may not realize that the values being used are regular expressions until they are in production and values that accidentally have wildcards in them are used? Or they may simply forget to do wildcard escaping code since everything would "just work" until they did encounter them?

          When invoked in the zoo keeper mode, the input field value comes in as a list whereas bypassing ZK and directly hitting Solr, it comes in as a String to the patch code segment to handle the atomic updates using field modifier "remove".

          The original patch SOLR-3862-3.patch creates a regular expression pattern on the incoming field value to be removed. The pattern is used to create a matcher and iterate through the original list of values. If the incoming field value is a list, the matcher does not match correctly because of the additional parenthesis like below:

          In the below example, "CA" is sent in as a input field value to be removed. Since the patch code was using the toString(), the list values are encapsulated within the parenthesis like [CA]. So, this pattern can match only to "C" or "A" and not to "CA". So, I had to get into the Solr code to troubleshoot this issue

          Hitting Solr using Zoo Keeper:
          Pattern p = Pattern.compile("[CA]");
          Matcher m = p.matcher("CA");
          boolean b = m.matches(); returns false and so the remove does not work if the incoming field value comes in a list.

          Hitting Solr directly:
          Pattern p = Pattern.compile("CA");
          Matcher m = p.matcher("CA");
          boolean b = m.matches(); returns true and so the remove works if the incoming field value comes in as a String.

            • 3) Perhaps we want a separate way to specify "value" vs "regex". I assume "value" will be a much more common usecase than regex (although I do like the power that regex brings).
              • I agree with you that "value" is a most common use case and that's the reason, I got rid of the "regex".

          Please review this patch https://issues.apache.org/jira/secure/attachment/12637687/SOLR-3862.patch that has my fix.

          Show
          Alaknantha added a comment - - edited Yonik, Are you looking at the older patch? https://issues.apache.org/jira/secure/attachment/12637687/SOLR-3862.patch is my latest patch where I got rid of the regular expression usage. 1) Do we want a way to specify the removal of multiple values? Perhaps "remove" : [ "A","B","C" ] Yes, this patch supports removal of multiple values. 2) What are the downsides to using regex? Someone may not realize that the values being used are regular expressions until they are in production and values that accidentally have wildcards in them are used? Or they may simply forget to do wildcard escaping code since everything would "just work" until they did encounter them? Yes, I ran into this regular expression issue when I tried to use the field modifier "remove" provided by the older patch https://issues.apache.org/jira/secure/attachment/12589605/SOLR-3862-3.patch in my project. That's why I got rid of the usage of the regular expression and use the "value" comparisons. Here is the issue that I ran into: When invoked in the zoo keeper mode, the input field value comes in as a list whereas bypassing ZK and directly hitting Solr, it comes in as a String to the patch code segment to handle the atomic updates using field modifier "remove". The original patch SOLR-3862 -3.patch creates a regular expression pattern on the incoming field value to be removed. The pattern is used to create a matcher and iterate through the original list of values. If the incoming field value is a list, the matcher does not match correctly because of the additional parenthesis like below: In the below example, "CA" is sent in as a input field value to be removed. Since the patch code was using the toString(), the list values are encapsulated within the parenthesis like [CA] . So, this pattern can match only to "C" or "A" and not to "CA". So, I had to get into the Solr code to troubleshoot this issue Hitting Solr using Zoo Keeper: Pattern p = Pattern.compile(" [CA] "); Matcher m = p.matcher("CA"); boolean b = m.matches(); returns false and so the remove does not work if the incoming field value comes in a list. Hitting Solr directly: Pattern p = Pattern.compile("CA"); Matcher m = p.matcher("CA"); boolean b = m.matches(); returns true and so the remove works if the incoming field value comes in as a String. 3) Perhaps we want a separate way to specify "value" vs "regex". I assume "value" will be a much more common usecase than regex (although I do like the power that regex brings). I agree with you that "value" is a most common use case and that's the reason, I got rid of the "regex". Please review this patch https://issues.apache.org/jira/secure/attachment/12637687/SOLR-3862.patch that has my fix.
          Hide
          Erick Erickson added a comment -

          Just though I should add that I'm in no great hurry to commit this, but I thought was some nice functionality. I happen to have some time this weekend so was getting around to it (finally). I'm trying to carve out some time to look at people's patches on a more regular basis...

          It looks like sometime next week then to allow for some more refinements and I'll look more carefully.

          Alaknatha:

          If you want to put the regex stuff back in I can help beef up the tests. I think Yonik's comment about the power of regex is a good one. What do you think about putting that back and calling it something like "removeregex"? I really like the idea of requiring the user to specify the intent, regexes can be dangerous...

          Show
          Erick Erickson added a comment - Just though I should add that I'm in no great hurry to commit this, but I thought was some nice functionality. I happen to have some time this weekend so was getting around to it (finally). I'm trying to carve out some time to look at people's patches on a more regular basis... It looks like sometime next week then to allow for some more refinements and I'll look more carefully. Alaknatha: If you want to put the regex stuff back in I can help beef up the tests. I think Yonik's comment about the power of regex is a good one. What do you think about putting that back and calling it something like "removeregex"? I really like the idea of requiring the user to specify the intent, regexes can be dangerous...
          Hide
          Yonik Seeley added a comment -

          Yonik, Are you looking at the older patch? https://issues.apache.org/jira/secure/attachment/12637687/SOLR-3862.patch is my latest patch where I got rid of the regular expression usage.

          Heh. This is exactly what I'm talking about. It can be hard to reconstruct the current state of the API by going through all of the previous comments, or by reading a patch - both take unnecessary time/work and are error prone.

          This is the current API (in JSON) currently being proposed:

          A field like so will remove all values that match the given value "Cyberpunk":
           "cat"  : {"remove":"Cyberpunk"}
          Multiple values can be specified for removal:
           "numbers"  : {"remove":[3,7]}
          If the original field contained [8,3,3,5,7,1] then the remove specified above will result in [8,5,1]
          
          It's not an error if no values are actually removed or if the field does not exist.
          
          Full command example:
          curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
          [
           {"id"    : "book1",
            "cat"   : {"remove":"Cyberpunk"}
           }
          ]'
          

          FWIW, I'm fine with this latest API.

          Show
          Yonik Seeley added a comment - Yonik, Are you looking at the older patch? https://issues.apache.org/jira/secure/attachment/12637687/SOLR-3862.patch is my latest patch where I got rid of the regular expression usage. Heh. This is exactly what I'm talking about. It can be hard to reconstruct the current state of the API by going through all of the previous comments, or by reading a patch - both take unnecessary time/work and are error prone. This is the current API (in JSON) currently being proposed: A field like so will remove all values that match the given value "Cyberpunk" : "cat" : { "remove" : "Cyberpunk" } Multiple values can be specified for removal: "numbers" : { "remove" :[3,7]} If the original field contained [8,3,3,5,7,1] then the remove specified above will result in [8,5,1] It's not an error if no values are actually removed or if the field does not exist. Full command example: curl http: //localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ { "id" : "book1" , "cat" : { "remove" : "Cyberpunk" } } ]' FWIW, I'm fine with this latest API.
          Hide
          Alaknantha added a comment -

          Erick:

          Sure, I will add back the original code to handle remove by regex and will call it "removeregex". Do you also want to add back "replace" functionality that was in the original patch?

          Show
          Alaknantha added a comment - Erick: Sure, I will add back the original code to handle remove by regex and will call it "removeregex". Do you also want to add back "replace" functionality that was in the original patch?
          Hide
          Erick Erickson added a comment - - edited

          Hmmm, I don't have strong preferences. I'm always torn between "adding this in while we're about it" and waiting for someone to chime in that they have a use-case for this functionality.

          I started out thinking that remove/add would satisfy replace, but that pair doesn't handle two use-cases that come to mind:
          1> you have multiple values that match the regex in remove that you want to update-but-preseve. Imagine your field has
          run
          running
          runners
          and you wanted the result to be
          run
          run$
          run$
          I don't see how remove/add would accomplish this in the regex case (say run.*). The remove would delete "running" and "runners" and you wouldn't know how many values to re-add.

          2> Similarly, if you have two multiValued fields that are expected to be kept in parallel and wanted to update only one field. I'm assuming that the remove from one field and subsequent add would put the new value at the end rather then the original position.

          So let's put it back and beef up the tests. Yonik Seeley any thoughts?

          Show
          Erick Erickson added a comment - - edited Hmmm, I don't have strong preferences. I'm always torn between "adding this in while we're about it" and waiting for someone to chime in that they have a use-case for this functionality. I started out thinking that remove/add would satisfy replace, but that pair doesn't handle two use-cases that come to mind: 1> you have multiple values that match the regex in remove that you want to update-but-preseve. Imagine your field has run running runners and you wanted the result to be run run$ run$ I don't see how remove/add would accomplish this in the regex case (say run.*). The remove would delete "running" and "runners" and you wouldn't know how many values to re-add. 2> Similarly, if you have two multiValued fields that are expected to be kept in parallel and wanted to update only one field. I'm assuming that the remove from one field and subsequent add would put the new value at the end rather then the original position. So let's put it back and beef up the tests. Yonik Seeley any thoughts?
          Hide
          Yonik Seeley added a comment -

          So let's put it back and beef up the tests. Yonik Seeley any thoughts?

          Perhaps I'm starting to sound monotonous, but I want to see a proposal, not a pointer to a patch

          Show
          Yonik Seeley added a comment - So let's put it back and beef up the tests. Yonik Seeley any thoughts? Perhaps I'm starting to sound monotonous, but I want to see a proposal, not a pointer to a patch
          Hide
          Erick Erickson added a comment - - edited

          Right, it would look something like this:

          
          Regular expressions can also be used to remove values by specifying removeregex, as:
          
          "activity" : {"removeregex" : "run.+"}
          
          Given a document with a field having values
          
          [run, running, runs runner], the above would leave only [run].
          
          Multiple regexes can be specified, as
          "activity" : {"removeregex" : ["run.+", "ran.*"] }
          
          So a document with field values [run runner ran ransack] would be reduced to
          [run]
          
          There are two new commands, "replace" and "replaceregex"
          
          "activity" : {"replace" : {"pat1" : "replacement1"}  }
          
          A document with "activity" of ["pat", "pat1", "pat1more"] would result in
          
          ["pat", "replacement1", "pat1more"]
          
          Note that order is preserved. That is, if the pattern being replaced is the third value in a multivalued field with 6 values, the replacement will also be in the third position.
          
          Multiple patterns and replacements can be specified, as
          "activity" : {"replace" :  [ {"pat1" : "replacement1"}, {"pat2" : "replacement2"} ]  }
          
          replaceregex is similar, but all the "pat*" examples above may be regular expressions, as
          
          "activity" : {"replaceregex" : {"pat.+" : "replacement1"}  }
          and
          "activity" : {"replaceregex" :   [ {"pat.+" : "replacement1"}, {"part.*" : "replacement2"} ]  }
          
          Currently, back references are _not_ supported.
          
          Patterns and replacements are _not_ chained. That is, the following
          
          "activity" : {"replaceregex" :   [ {"pat.+" : "replacement1"}, {"repl.+" : "replacement2"} ]  }
          applied to a field containing
          ["patterns"]
          would yield ["replacement1"] rather than ["replacement2"]
          
          

          Hmmmm, altogether I'm not sure how I feel about allowing multiple replaceregex's to be specified, the syntax is kinda ugly.

          I'm not particularly wedded to the regex stuff but I can imagine it to be useful. I'd also be fine if the replace were left out.

          Show
          Erick Erickson added a comment - - edited Right, it would look something like this: Regular expressions can also be used to remove values by specifying removeregex, as: "activity" : { "removeregex" : "run.+" } Given a document with a field having values [run, running, runs runner], the above would leave only [run]. Multiple regexes can be specified, as "activity" : { "removeregex" : [ "run.+" , "ran.*" ] } So a document with field values [run runner ran ransack] would be reduced to [run] There are two new commands, "replace" and "replaceregex" "activity" : { "replace" : { "pat1" : "replacement1" } } A document with "activity" of [ "pat" , "pat1" , "pat1more" ] would result in [ "pat" , "replacement1" , "pat1more" ] Note that order is preserved. That is, if the pattern being replaced is the third value in a multivalued field with 6 values, the replacement will also be in the third position. Multiple patterns and replacements can be specified, as "activity" : { "replace" : [ { "pat1" : "replacement1" }, { "pat2" : "replacement2" } ] } replaceregex is similar, but all the "pat*" examples above may be regular expressions, as "activity" : { "replaceregex" : { "pat.+" : "replacement1" } } and "activity" : { "replaceregex" : [ { "pat.+" : "replacement1" }, { "part.*" : "replacement2" } ] } Currently, back references are _not_ supported. Patterns and replacements are _not_ chained. That is, the following "activity" : { "replaceregex" : [ { "pat.+" : "replacement1" }, { "repl.+" : "replacement2" } ] } applied to a field containing [ "patterns" ] would yield [ "replacement1" ] rather than [ "replacement2" ] Hmmmm, altogether I'm not sure how I feel about allowing multiple replaceregex's to be specified, the syntax is kinda ugly. I'm not particularly wedded to the regex stuff but I can imagine it to be useful. I'd also be fine if the replace were left out.
          Hide
          Alaknantha added a comment - - edited

          Erick: None of the existing patches support multiple regex's to be specified for "remove" and "replace". Would you like me to code that along with Junits and provide a patch?

          Show
          Alaknantha added a comment - - edited Erick: None of the existing patches support multiple regex's to be specified for "remove" and "replace". Would you like me to code that along with Junits and provide a patch?
          Hide
          Erick Erickson added a comment -

          Don't quite know. In case you're wondering, I've been a bit under the weather so haven't been looking for a couple of days.

          Even though I mentioned that back references are not supported, I'd rank supporting that over multiple regexes.

          Its not clear to me how useful multiple regexes really would be though. I suppose if we're trying to get the interface in place it would be best to clarify that up front though. Straw-man. Let's put multiple regexes in as in my example.

          Yonik Seeley got an opinion here?

          Show
          Erick Erickson added a comment - Don't quite know. In case you're wondering, I've been a bit under the weather so haven't been looking for a couple of days. Even though I mentioned that back references are not supported, I'd rank supporting that over multiple regexes. Its not clear to me how useful multiple regexes really would be though. I suppose if we're trying to get the interface in place it would be best to clarify that up front though. Straw-man. Let's put multiple regexes in as in my example. Yonik Seeley got an opinion here?
          Hide
          Erick Erickson added a comment -

          New version of the patch, some minor refactoring moved the 'remove' and 'inc' operations to their own methods. Hardened the tests, they were failing randomly since the same document ID was being re-used without deleting docs between tests.

          Unless there are objections, I'll commit this sometime this coming week. All tests pass.

          Show
          Erick Erickson added a comment - New version of the patch, some minor refactoring moved the 'remove' and 'inc' operations to their own methods. Hardened the tests, they were failing randomly since the same document ID was being re-used without deleting docs between tests. Unless there are objections, I'll commit this sometime this coming week. All tests pass.
          Hide
          Erick Erickson added a comment -

          I'll add a slightly updated patch in a few. I added a bit more testing, and included CHANGES.txt.

          I think the regex question can be moved to another JIRA and I also think a new removeregex is the cleanest way to distinguish between the two, we can debate in a new JIRA...

          Erick

          Show
          Erick Erickson added a comment - I'll add a slightly updated patch in a few. I added a bit more testing, and included CHANGES.txt. I think the regex question can be moved to another JIRA and I also think a new removeregex is the cleanest way to distinguish between the two, we can debate in a new JIRA... Erick
          Hide
          Erick Erickson added a comment -

          Promised patch, I'll commit after another set of test runs.

          Show
          Erick Erickson added a comment - Promised patch, I'll commit after another set of test runs.
          Hide
          Erick Erickson added a comment -

          Same patch without a tab (failed precommit)

          Show
          Erick Erickson added a comment - Same patch without a tab (failed precommit)
          Hide
          Erick Erickson added a comment -

          Rats! Failed to mention the JIRA in the checkin
          trunk revision - 1588385

          Show
          Erick Erickson added a comment - Rats! Failed to mention the JIRA in the checkin trunk revision - 1588385
          Hide
          ASF subversion and git services added a comment -

          Commit 1588391 from erick@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1588391 ]

          SOLR-3862 add 'remove' as update option for atomically removing a value from a multivalued field. Note that revision 1588385 is for this patch on trunk

          Show
          ASF subversion and git services added a comment - Commit 1588391 from erick@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1588391 ] SOLR-3862 add 'remove' as update option for atomically removing a value from a multivalued field. Note that revision 1588385 is for this patch on trunk
          Hide
          Steven Bower added a comment -

          Added ticket for regex based removal SOLR-6772

          Show
          Steven Bower added a comment - Added ticket for regex based removal SOLR-6772

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Jim Musil
            • Votes:
              18 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development