Solr
  1. Solr
  2. SOLR-3141

Deprecate OPTIMIZE command in Solr

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.5
    • Fix Version/s: 4.8
    • Component/s: update
    • Labels:

      Description

      Background: LUCENE-3454 renames optimize() as forceMerge(). Please read that issue first.

      Now that optimize() is rarely necessary anymore, and renamed in Lucene APIs, what should be done with Solr's ancient optimize command?

      1. SOLR-3141.patch
        0.6 kB
        Yonik Seeley
      2. SOLR-3141.patch
        1.0 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Shawn Heisey added a comment -

          Before I read HossMan's proposals thoroughly, I had these thoughts:


          I would support removing the optimize button from the GUI, or at least removing it from the Overview page. Keeping it on the CoreAdmin page would not be a bad thing, optionally with at least one confirmation dialog that reminds the user that optimization is not normally required.

          Deprecating "optimize" from the GUI and the API in favor of forceMerge would not make me upset either, as long as it continued to work through all 4.x versions. Based on what happened with waitFlush and the PHP Solr API packages after the 4.0 release, this is a dangerous path, but if we kept optimize around until 6.0, perhaps it might be OK.

          After reading the proposals, I think there might be a small amount of merit in my ideas, but his ideas are safer.

          Show
          Shawn Heisey added a comment - Before I read HossMan's proposals thoroughly, I had these thoughts: — I would support removing the optimize button from the GUI, or at least removing it from the Overview page. Keeping it on the CoreAdmin page would not be a bad thing, optionally with at least one confirmation dialog that reminds the user that optimization is not normally required. Deprecating "optimize" from the GUI and the API in favor of forceMerge would not make me upset either, as long as it continued to work through all 4.x versions. Based on what happened with waitFlush and the PHP Solr API packages after the 4.0 release, this is a dangerous path, but if we kept optimize around until 6.0, perhaps it might be OK. — After reading the proposals, I think there might be a small amount of merit in my ideas, but his ideas are safer.
          Hide
          Jan Høydahl added a comment -

          Some time has passed, what to do with this? Was the log warning patch committed?

          Show
          Jan Høydahl added a comment - Some time has passed, what to do with this? Was the log warning patch committed?
          Hide
          Jan Høydahl added a comment -

          So, any thoughts on Hoss Man's proposals above?

          Show
          Jan Høydahl added a comment - So, any thoughts on Hoss Man 's proposals above?
          Hide
          Mark Miller added a comment -

          No assignee or action in some time - pushing to 4.2 - if it's a mistake, please bring it back into 4.1.

          Show
          Mark Miller added a comment - No assignee or action in some time - pushing to 4.2 - if it's a mistake, please bring it back into 4.1.
          Hide
          Dotan Cohen added a comment -

          The problem with optimize is not the name. The problem is that the Solr admin panel suggests that we optimize often. In a Solr admin panel click the name of your index ("collection1" for instance) and what do you see? A big "Optimize Now" button alongside a graphical indicator that the index is not optimized.

          Show
          Dotan Cohen added a comment - The problem with optimize is not the name. The problem is that the Solr admin panel suggests that we optimize often. In a Solr admin panel click the name of your index ("collection1" for instance) and what do you see? A big "Optimize Now" button alongside a graphical indicator that the index is not optimized.
          Hide
          Erick Erickson added a comment -

          Does SolrJ then need to have a forceMerge rather than optimize? Which isn't deprecated in 3x BTW.

          Show
          Erick Erickson added a comment - Does SolrJ then need to have a forceMerge rather than optimize? Which isn't deprecated in 3x BTW.
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Hoss Man added a comment -

          I don't have the energy to really get in depth with all of the discussion that's taken place so far, i'll try to keep my comments brief:

          0) i'm a fan of the patch currently attached.

          1) i largely agree with most of yonik's points – this is a documentation problem first and foremost. Saying that all people who optimize are wrong is ridiculous, and breaking something that has use and value for a set of people just because some other set of people are using it foolishly seems really absurd.

          2) changing the "optimize" command to be a no-op with a warning logged, or a failure, where the documented "fix" to regain old behavior for people who genuinely need it is to search & replace the string "optimize" with some new string "forceMerge" seems uterly absurd to me. this is not the first time we've had a param name that people later regretted giving the name that we did – are we going to change all of them for 4.0? Unlike a method renamed in java code where it's easy to see how the change affects you because of compilation failures, this kind of HTTP param change is a serious pain in the ass for people with client apps written using multiple languages/libraries ... naming consistency for existing users seems far more important then having perfect names.

          3) Even if the goal is to force people to evaluate whether they really want to merge down to one segment, we have to consider how hard we make things for people when the answer is "yes". If someone is using a client library/app to talk to Solr it may not be easy/simple/possible for them to replace "optimize" with "forceMerge" or something like it w/o mucking in the internals of that library – there's no reason to piss off users like that.

          4) any discussion about renaming/removing "optimize" in the Solr HTTP APIs should really consider how that will impact a few other user visible things...

          • <listener event="postOptimize" /> hooks in solrconfig and the corisponding SolrEventListener.postOpimize method
          • SolrDeletionPolicy has options related to how many optimized indexes to keep
          • spellchecker has options relating to building on optimize (although if i remember correctly there is a bug about this being broken so it can probably die no problem)

          5) Assuming that too many people optimize when the shouldn't, either out of ignorance or because their tools do it out of ignorance and we want to help minimize that moving forward; and given my opinion that renaming "optimize" will only hurt people w/o actually helping the root problem – here's my straw man proposal to try and improve the situation (similar to what jan suggested but taking into account that we already support a "maxSegments" option when doing optimize commands) ...

          • commit the attached patch as is (it's just plain a good idea, regardless of anything else we might do)
          • change CommitUpdateCommand.maxOptimizeSegments so it defaults to "-1" and document that when the value is less then 0 it means the UpdateHandler configuration determines the value.
          • add a new <defaultOptimizeSegments/> config option to <updateHandler/> - make the UpdateHandler use that value anytime CommitUpdateCommand.maxOptimizeSegments is less then 0, and for backcompat have it default to "1" if not specified.
          • update the example configs to include <defaultOptimizeSegments>9999999</defaultOptimizeSegments> with a comment warning against hte evils of over-optimization
          • change the code in Solr which deals with <optimize ... /> formated instructions so that any SolrParams in the request with names the same as xml attributes override the attributes – ie: POST /update?maxSegments=4 with data: <optimize maxSegments="9" /> should result in a CommitUpdateCommand with maxOptimizeSegments=4

          The end result being:

          • new users who start with new configs have an UpdateHandler that is going effectively ignore "optimize" commands that don't specify a "maxSegments"
          • nothing breaks for existing users
          • existing users who only want to allow optimize commands when "maxSegments" is specified can cut/paste that oneline <defaultOptimizeSegments/> config
          • new and existing users who want Solr to ignore all optimize commands, even when they do have a "maxSegments", can configure an invariant maxSegments=9999999 param on the affected request handlers
          Show
          Hoss Man added a comment - I don't have the energy to really get in depth with all of the discussion that's taken place so far, i'll try to keep my comments brief: 0) i'm a fan of the patch currently attached. 1) i largely agree with most of yonik's points – this is a documentation problem first and foremost. Saying that all people who optimize are wrong is ridiculous, and breaking something that has use and value for a set of people just because some other set of people are using it foolishly seems really absurd. 2) changing the "optimize" command to be a no-op with a warning logged, or a failure, where the documented "fix" to regain old behavior for people who genuinely need it is to search & replace the string "optimize" with some new string "forceMerge" seems uterly absurd to me. this is not the first time we've had a param name that people later regretted giving the name that we did – are we going to change all of them for 4.0? Unlike a method renamed in java code where it's easy to see how the change affects you because of compilation failures, this kind of HTTP param change is a serious pain in the ass for people with client apps written using multiple languages/libraries ... naming consistency for existing users seems far more important then having perfect names. 3) Even if the goal is to force people to evaluate whether they really want to merge down to one segment, we have to consider how hard we make things for people when the answer is "yes". If someone is using a client library/app to talk to Solr it may not be easy/simple/possible for them to replace "optimize" with "forceMerge" or something like it w/o mucking in the internals of that library – there's no reason to piss off users like that. 4) any discussion about renaming/removing "optimize" in the Solr HTTP APIs should really consider how that will impact a few other user visible things... <listener event="postOptimize" /> hooks in solrconfig and the corisponding SolrEventListener.postOpimize method SolrDeletionPolicy has options related to how many optimized indexes to keep spellchecker has options relating to building on optimize (although if i remember correctly there is a bug about this being broken so it can probably die no problem) 5) Assuming that too many people optimize when the shouldn't, either out of ignorance or because their tools do it out of ignorance and we want to help minimize that moving forward; and given my opinion that renaming "optimize" will only hurt people w/o actually helping the root problem – here's my straw man proposal to try and improve the situation (similar to what jan suggested but taking into account that we already support a "maxSegments" option when doing optimize commands) ... commit the attached patch as is (it's just plain a good idea, regardless of anything else we might do) change CommitUpdateCommand.maxOptimizeSegments so it defaults to "-1" and document that when the value is less then 0 it means the UpdateHandler configuration determines the value. add a new <defaultOptimizeSegments/> config option to <updateHandler/> - make the UpdateHandler use that value anytime CommitUpdateCommand.maxOptimizeSegments is less then 0, and for backcompat have it default to "1" if not specified. update the example configs to include <defaultOptimizeSegments>9999999</defaultOptimizeSegments> with a comment warning against hte evils of over-optimization change the code in Solr which deals with <optimize ... /> formated instructions so that any SolrParams in the request with names the same as xml attributes override the attributes – ie: POST /update?maxSegments=4 with data: <optimize maxSegments="9" /> should result in a CommitUpdateCommand with maxOptimizeSegments=4 The end result being: new users who start with new configs have an UpdateHandler that is going effectively ignore "optimize" commands that don't specify a "maxSegments" nothing breaks for existing users existing users who only want to allow optimize commands when "maxSegments" is specified can cut/paste that oneline <defaultOptimizeSegments/> config new and existing users who want Solr to ignore all optimize commands, even when they do have a "maxSegments", can configure an invariant maxSegments=9999999 param on the affected request handlers
          Hide
          Yonik Seeley added a comment -

          It's a terrible name at this point. Why are we stuck with terrible?

          I don't agree it's a terrible name I guess.

          Show
          Yonik Seeley added a comment - It's a terrible name at this point. Why are we stuck with terrible? I don't agree it's a terrible name I guess.
          Hide
          Mark Miller added a comment -

          I think when it comes to API breaks, trying to say we can't fix this one because we can't fix every old little thing doesn't jive. The name is clearly not a good one, and the call will not be the right move for most people that upgrade to 4. Having to rethink that will be doing 99% of users a favor. Changing the name will be doing all future users a favor.

          I think 4 should be about getting things right without clinging to old baggage. We are not talking about the update or request apis here. We are talking about a very expensive, very poorly named, very little little returns API call that is certainly over used (and much of the over use is not going to end up on google).

          Making those that upgrade rethink optimize seems like just what the Dr ordered - we can add it to the release announce, the release notes, etc. Even though I know exactly what this does, even though i know the price/benefits - I still want to call this thing at least once a week. It's a terrible name at this point. Why are we stuck with terrible?

          Show
          Mark Miller added a comment - I think when it comes to API breaks, trying to say we can't fix this one because we can't fix every old little thing doesn't jive. The name is clearly not a good one, and the call will not be the right move for most people that upgrade to 4. Having to rethink that will be doing 99% of users a favor. Changing the name will be doing all future users a favor. I think 4 should be about getting things right without clinging to old baggage. We are not talking about the update or request apis here. We are talking about a very expensive, very poorly named, very little little returns API call that is certainly over used (and much of the over use is not going to end up on google). Making those that upgrade rethink optimize seems like just what the Dr ordered - we can add it to the release announce, the release notes, etc. Even though I know exactly what this does, even though i know the price/benefits - I still want to call this thing at least once a week. It's a terrible name at this point. Why are we stuck with terrible?
          Hide
          Yonik Seeley added a comment -

          Add the new forceMerge feature, but instead of true/false, it takes N as number of segments, i.e. &forceMerge=2. This adds value to Solr's API

          But we already have this functionality as a maxSegments parameter to optimize.

          Show
          Yonik Seeley added a comment - Add the new forceMerge feature, but instead of true/false, it takes N as number of segments, i.e. &forceMerge=2. This adds value to Solr's API But we already have this functionality as a maxSegments parameter to optimize.
          Hide
          Jan Høydahl added a comment -

          This is a much bigger real problem (because people had no soft commit and hence hard commit was the only option). We should probably open up a new issue for this one.

          SOLR-3146

          Show
          Jan Høydahl added a comment - This is a much bigger real problem (because people had no soft commit and hence hard commit was the only option). We should probably open up a new issue for this one. SOLR-3146
          Hide
          Jan Høydahl added a comment -

          @Yonik, How would you feel about this approach instead:

          • Add the new forceMerge feature, but instead of true/false, it takes N as number of segments, i.e. &forceMerge=2. This adds value to Solr's API
          • Keep the old &optimize=true API (equivalent to forceMerge=1), but let users control in solrconfig.xml how an old optimize is interpreted. The option could look like (don't mind the naming for now):
            <mainIndex>
             <oldOptimizeIsIntrepretedAs>noop|noopWithLogWarning|commit|softCommit|forceMerge=N</oldOptimizeIsIntrepretedAs>
            </mainIndex>
            

          Default could be "noopWithLogWarning", and nothing would happen on an attempted optimize, except logging a warning in logs pointing people to some documentation. This will give people three choices: A) Stop using optimize if they don't need it. Problem solved. B) If they wind up really needing it, start using forceMerge=N instead. Problem solved or C) Change the config param to whatever suits their situation the best, e.g. "forceMerge=1" would mimic old behaviour but "commit" would cause a commit to happen on optimize, "noop" would do noop, but get rid of log warnings etc. This would be for people who cannot or won't change their own code.

          Show
          Jan Høydahl added a comment - @Yonik, How would you feel about this approach instead: Add the new forceMerge feature, but instead of true/false, it takes N as number of segments, i.e. &forceMerge=2. This adds value to Solr's API Keep the old &optimize=true API (equivalent to forceMerge=1), but let users control in solrconfig.xml how an old optimize is interpreted. The option could look like (don't mind the naming for now): <mainIndex> <oldOptimizeIsIntrepretedAs> noop|noopWithLogWarning|commit|softCommit|forceMerge=N </oldOptimizeIsIntrepretedAs> </mainIndex> Default could be "noopWithLogWarning", and nothing would happen on an attempted optimize, except logging a warning in logs pointing people to some documentation. This will give people three choices: A) Stop using optimize if they don't need it. Problem solved. B) If they wind up really needing it, start using forceMerge=N instead. Problem solved or C) Change the config param to whatever suits their situation the best, e.g. "forceMerge=1" would mimic old behaviour but "commit" would cause a commit to happen on optimize, "noop" would do noop, but get rid of log warnings etc. This would be for people who cannot or won't change their own code.
          Hide
          Yonik Seeley added a comment -

          and another paragraph about softCommit/commitWithin as preferred to explicit commit, which is also a huge mistake many people do, they do over-committing!

          This is a much bigger real problem (because people had no soft commit and hence hard commit was the only option). We should probably open up a new issue for this one.

          Show
          Yonik Seeley added a comment - and another paragraph about softCommit/commitWithin as preferred to explicit commit, which is also a huge mistake many people do, they do over-committing! This is a much bigger real problem (because people had no soft commit and hence hard commit was the only option). We should probably open up a new issue for this one.
          Hide
          Yonik Seeley added a comment -

          The compensation is that they are forced to again look at that code an then they think about removing the call alltogether.

          The proposal simply breaks existing systems (on purpose) on upgrade with no offsetting gain in functionality, just because we believe some people have made the wrong tradeoff in their app. This is not the right solution.

          We see people making what we believe to be the wrong tradeoffs all the time in Solr. One example is optimizing for query performance by pumping up cache sizes to insane levels, pumping up the heap to compensate, and then being plagued with long GC times. The answer is not to second guess everyone and break existing configurations. People will continue to make mistakes like this, and even if optimize was changed to forceMerge, you can be assured that some people will still make the wrong trade-off in the future using the new name.

          I've thought about this for a while now... please consider this my formal veto to this change.

          Show
          Yonik Seeley added a comment - The compensation is that they are forced to again look at that code an then they think about removing the call alltogether. The proposal simply breaks existing systems (on purpose) on upgrade with no offsetting gain in functionality, just because we believe some people have made the wrong tradeoff in their app. This is not the right solution. We see people making what we believe to be the wrong tradeoffs all the time in Solr. One example is optimizing for query performance by pumping up cache sizes to insane levels, pumping up the heap to compensate, and then being plagued with long GC times. The answer is not to second guess everyone and break existing configurations. People will continue to make mistakes like this, and even if optimize was changed to forceMerge, you can be assured that some people will still make the wrong trade-off in the future using the new name. I've thought about this for a while now... please consider this my formal veto to this change.
          Hide
          Jan Høydahl added a comment -

          I think 4.0 is a good train in which to do this rename, when people will anyway take a thorough new look at all the changes, and most will hopefully discover that they do not need forceMerge even if they used optimize before. And I agree, in 4.x, "optimize" should not be a silent NOOP, but instead yell loudly in the logs.

          Perhaps an official migration guide on the CMS would be helpful too when 4.0 hits the road. Such a guide would be more in-depth than the upgrading notes in CHANGES. We could have a paragraph about optimize/forceMerge, and another paragraph about softCommit/commitWithin as preferred to explicit commit, which is also a huge mistake many people do, they do over-committing!

          Show
          Jan Høydahl added a comment - I think 4.0 is a good train in which to do this rename, when people will anyway take a thorough new look at all the changes, and most will hopefully discover that they do not need forceMerge even if they used optimize before. And I agree, in 4.x, "optimize" should not be a silent NOOP, but instead yell loudly in the logs. Perhaps an official migration guide on the CMS would be helpful too when 4.0 hits the road. Such a guide would be more in-depth than the upgrading notes in CHANGES. We could have a paragraph about optimize/forceMerge, and another paragraph about softCommit/commitWithin as preferred to explicit commit, which is also a huge mistake many people do, they do over-committing!
          Hide
          Robert Muir added a comment -

          A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.

          But I don't think things are fixed in stone: this is an open source project and it would be bad if things
          never changed. We aren't putting a gun to their head forcing them to upgrade either, so I don't understand
          the pain compensation... but it won't hold a candle to the pain all these unnecessary optimizes must be
          causing users hard disk drives.

          Show
          Robert Muir added a comment - A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality. But I don't think things are fixed in stone: this is an open source project and it would be bad if things never changed. We aren't putting a gun to their head forcing them to upgrade either, so I don't understand the pain compensation... but it won't hold a candle to the pain all these unnecessary optimizes must be causing users hard disk drives.
          Hide
          Uwe Schindler added a comment -

          A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.

          The compensation is that they are forced to again look at that code an then they think about removing the call alltogether.

          Show
          Uwe Schindler added a comment - A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality. The compensation is that they are forced to again look at that code an then they think about removing the call alltogether.
          Hide
          Yonik Seeley added a comment -

          A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.

          Show
          Yonik Seeley added a comment - A slight improvement in name does not come anywhere near compensating for the pain of having countless external systems and users having to change their code for no gain in functionality.
          Hide
          Uwe Schindler added a comment -

          I doubt it. An dhow did they find the command in the first place?

          By copying one of those "shiny" code examples I posted!

          Just to come back to programmers that should have read the documentation, but in fact they did not. The best example from the above list is http://drupal.org/node/292662. Drupal is one of the most often used CMS systems (I just mention that your company also uses it for their home page) and its installed on thousands of servers. And this tool also contains a full text search engine (maybe your company is not using that one), but this one called commit and optimize after every update (until they fixed it). Isn't that funny. In fact Drupal users are a huuuuuuuuuuuuuuuuuuuuuuuuuuuuge majority that dont know what their system is doing under the hood and largely depend on the fact that PHP programmers like the Drupal ones dont call optimize just because it's called optimize.

          Show
          Uwe Schindler added a comment - I doubt it. An dhow did they find the command in the first place? By copying one of those "shiny" code examples I posted! Just to come back to programmers that should have read the documentation, but in fact they did not. The best example from the above list is http://drupal.org/node/292662 . Drupal is one of the most often used CMS systems (I just mention that your company also uses it for their home page) and its installed on thousands of servers. And this tool also contains a full text search engine (maybe your company is not using that one), but this one called commit and optimize after every update (until they fixed it). Isn't that funny. In fact Drupal users are a huuuuuuuuuuuuuuuuuuuuuuuuuuuuge majority that dont know what their system is doing under the hood and largely depend on the fact that PHP programmers like the Drupal ones dont call optimize just because it's called optimize.
          Hide
          Robert Muir added a comment -

          twice that many visit the download page... but the actual link is hard to see

          We need a huge download button.

          Let's not be a nanny state.

          I don't think of it as a nanny state, its us fixing a mistake.
          The mistake was this method has a poor name.

          Show
          Robert Muir added a comment - twice that many visit the download page... but the actual link is hard to see We need a huge download button. Let's not be a nanny state. I don't think of it as a nanny state, its us fixing a mistake. The mistake was this method has a poor name.
          Hide
          Yonik Seeley added a comment -

          think the majority of users don't know what this command really does... we should rename it.

          I doubt it. An dhow did they find the command in the first place?
          The answer is documentation - wherever they learn about the command, let them know what it does.
          Let's not be a nanny state.

          Show
          Yonik Seeley added a comment - think the majority of users don't know what this command really does... we should rename it. I doubt it. An dhow did they find the command in the first place? The answer is documentation - wherever they learn about the command, let them know what it does. Let's not be a nanny state.
          Hide
          Yonik Seeley added a comment -

          Creative use of google, but it does't always add up. Just looking randomly at a couple:

          The vufind reference oddly states that you should optimize after updating, but it also states:

          Note: Optimizing the index can take a lot of server resources, so you should schedule your index updates and optimizations for non-peak times when possible.

          So you can see they have that very infrequent update model in mind, and they seem well aware of the cost of an optimize.

          The stackoverflow is a thing asking how to automate commit and optimize and how often he should optimize.

          And the archiveorange link mentions a guy optimizing, but it's certainly not clear at all that he shouldn't be.... we don't know his requirements.

          Solr is at 400 downloads a day via the website (twice that many visit the download page... but the actual link is hard to see!). Yes, I'll stand by "minority".

          Show
          Yonik Seeley added a comment - Creative use of google, but it does't always add up. Just looking randomly at a couple: The vufind reference oddly states that you should optimize after updating, but it also states: Note: Optimizing the index can take a lot of server resources, so you should schedule your index updates and optimizations for non-peak times when possible. So you can see they have that very infrequent update model in mind, and they seem well aware of the cost of an optimize. The stackoverflow is a thing asking how to automate commit and optimize and how often he should optimize. And the archiveorange link mentions a guy optimizing, but it's certainly not clear at all that he shouldn't be.... we don't know his requirements. Solr is at 400 downloads a day via the website (twice that many visit the download page... but the actual link is hard to see!). Yes, I'll stand by "minority".
          Hide
          Robert Muir added a comment -

          I think the majority of users don't know what this command really does... we should rename it.

          optimize just begs for people to use it. If this is really controversial, lets call
          a committer vote on dev@ and see what everyone thinks.

          Show
          Robert Muir added a comment - I think the majority of users don't know what this command really does... we should rename it. optimize just begs for people to use it. If this is really controversial, lets call a committer vote on dev@ and see what everyone thinks.
          Show
          Uwe Schindler added a comment - - edited We shouldn't penalize the majority of users who use APIs correctly due to some minority calling it when they have no idea what it does Minority?: https://github.com/mbaechler/OBM/blob/9e1c79e01fde7f78e87b125563c7e6730068e24d/ui/obminclude/of/of_indexingService.inc http://grokbase.com/t/lucene.apache.org/solr-user/2011/12/how-to-disable-auto-commit-and-auto-optimize-operation-after-addition-of-few-documents-through-dataimport-handler/16q7rwo6crvlzr5aoo3ic2bgd2ni http://support.sms-fed.com/tracker/browse/TDI-134 http://web.archiveorange.com/archive/v/AAfXf4khqdVNtnjqzodS http://vufind.org/wiki/performance#index_optimization http://netbeans.org/bugzilla/show_bug.cgi?id=205899 https://github.com/tonytw1/wellynews/blob/759960b7e7df6b77c9fa3791efb7da67dd27783e/src/java/nz/co/searchwellington/repositories/solr/SolrQueryService.java http://stackoverflow.com/questions/2787591/solr-autocommit-and-autooptimize http://opensource.timetric.com/sunburnt/indexmanagement.html http://drupal.org/node/292662 http://blog.aisleten.com/2008/01/26/optimizing-solr-and-rails-index-in-the-background/ http://www.searchworkings.org/forum/-/message_boards/view_message/412894#_19_message_412894 http://code.google.com/p/kiwi/source/browse/lmf-search/src/main/java/at/newmedialab/lmf/search/services/indexing/SolrIndexingServiceImpl.java?r=fbbeec96b5ad3d31364755a88218860405393cac
          Hide
          Yonik Seeley added a comment -

          I'm against deprecating optimize. We can't change the name of every operation that people might use incorrectly (and this is one of the easiest to understand), and we shouldn't here. We shouldn't penalize the majority of users who use APIs correctly due to some minority calling it when they have no idea what it does. Being a server with a whole ecosystem of other systems that talk to us (think like a database), we have a much higher bar for back compat changes in our interfaces.

          Show
          Yonik Seeley added a comment - I'm against deprecating optimize. We can't change the name of every operation that people might use incorrectly (and this is one of the easiest to understand), and we shouldn't here. We shouldn't penalize the majority of users who use APIs correctly due to some minority calling it when they have no idea what it does. Being a server with a whole ecosystem of other systems that talk to us (think like a database), we have a much higher bar for back compat changes in our interfaces.
          Hide
          Uwe Schindler added a comment - - edited

          I am fine with the log messages, I just would also like to deprecate the term "optimize" and change to "forceMerge". Thats all this issue is about. The above log messages would then apply to forceMerge. Of course old-style optimize would get a different message thats this is deprecated and the user is most-likely not want to call this.

          Show
          Uwe Schindler added a comment - - edited I am fine with the log messages, I just would also like to deprecate the term "optimize" and change to "forceMerge". Thats all this issue is about. The above log messages would then apply to forceMerge. Of course old-style optimize would get a different message thats this is deprecated and the user is most-likely not want to call this.
          Hide
          Yonik Seeley added a comment -

          I just checked the Solr tutorial and saw this:
          "There is also an optimize command that does the same thing as commit, in addition to merging all index segments into a single segment, making it faster to search and causing any deleted documents to be removed."

          It would be no great loss to just remove that sentence since it's just an introduction and not a reference.

          Show
          Yonik Seeley added a comment - I just checked the Solr tutorial and saw this: "There is also an optimize command that does the same thing as commit, in addition to merging all index segments into a single segment, making it faster to search and causing any deleted documents to be removed." It would be no great loss to just remove that sentence since it's just an introduction and not a reference.
          Hide
          Yonik Seeley added a comment -

          New version that also logs even when a number of segments is specified, and for expunge deletes also.

          Show
          Yonik Seeley added a comment - New version that also logs even when a number of segments is specified, and for expunge deletes also.
          Hide
          Yonik Seeley added a comment -

          Here's a warn patch.

          The text I used is this:

          log.warn("Starting optimize... reading and rewriting entire index.");

          It tries to just state what is going on, and tries not to indicate it's an error or that the user should not be doing it.

          Show
          Yonik Seeley added a comment - Here's a warn patch. The text I used is this: log.warn("Starting optimize... reading and rewriting entire index."); It tries to just state what is going on, and tries not to indicate it's an error or that the user should not be doing it.
          Hide
          Robert Muir added a comment -

          A warning message seems over the top.

          I don't think a warning message for a deprecated command is over the top,
          how else will people know to switch to 'forceMerge' (in the case they really need it).

          We already log warning messages if people use e.g. deprecated analyzers or other things,
          I'm just suggesting we deprecate the trappy name like anything else would be deprecated.
          It seems worse to me to silently deprecate something.

          By the way: I think it would also be nice if the forceMerge required n as a parameter,
          rather than defaulting to 1.

          Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it.

          +1, I think those are good improvements.

          Show
          Robert Muir added a comment - A warning message seems over the top. I don't think a warning message for a deprecated command is over the top, how else will people know to switch to 'forceMerge' (in the case they really need it). We already log warning messages if people use e.g. deprecated analyzers or other things, I'm just suggesting we deprecate the trappy name like anything else would be deprecated. It seems worse to me to silently deprecate something. By the way: I think it would also be nice if the forceMerge required n as a parameter, rather than defaulting to 1. Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it. +1, I think those are good improvements.
          Hide
          Walter Underwood added a comment -

          A warning message seems over the top. There are perfectly valid reasons to do a full merge. It is just fine as the last step if you rebuild a medium to small index every day, like we did at Netflix.

          I've worked on two other engines with automatic index merging, Ultraseek and MarkLogic. One called it "full merge", the other "force merges" (I think). Neither one logged a warning.

          Show
          Walter Underwood added a comment - A warning message seems over the top. There are perfectly valid reasons to do a full merge. It is just fine as the last step if you rebuild a medium to small index every day, like we did at Netflix. I've worked on two other engines with automatic index merging, Ultraseek and MarkLogic. One called it "full merge", the other "force merges" (I think). Neither one logged a warning.
          Hide
          Yonik Seeley added a comment - - edited

          Personally I feel the wiki text Yonik linked to is way too nice about this.

          Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it.

          An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use cases, this operation should be performed infrequently (like nightly), if at all, since it is very expensive and involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately.

          I would agree to make a serious log.warn()

          I'd be fine with that part. I'll give it a shot.

          Show
          Yonik Seeley added a comment - - edited Personally I feel the wiki text Yonik linked to is way too nice about this. Here's the current wiki text (I just modified it to suggest what "infrequently" might mean... i.e. nightly, not on the minute or something), added the term "very expensive" and bolded the "entire" to draw attention to it. An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use cases, this operation should be performed infrequently (like nightly), if at all, since it is very expensive and involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately. I would agree to make a serious log.warn() I'd be fine with that part. I'll give it a shot.
          Hide
          Uwe Schindler added a comment -

          Robert: I would also agree with this. If others dont want to make optimize() a noop, I would agree to make a serious log.warn() or even better log.fatal() out of it saying that it's a bad idea in most cases. And that it's deprecated (deprecation by log printing, funny). People who call optimize or forceMerge after ech single document will have a log filled with warning messages, this should make them look into it.
          In my opinion expungeDeletes and forceMerge should always print a warning-like message to the log, saying that it's doing something heavy and resource-wasteful. Optimize aditionally also say that its deprecated.

          Show
          Uwe Schindler added a comment - Robert: I would also agree with this. If others dont want to make optimize() a noop, I would agree to make a serious log.warn() or even better log.fatal() out of it saying that it's a bad idea in most cases. And that it's deprecated (deprecation by log printing, funny). People who call optimize or forceMerge after ech single document will have a log filled with warning messages, this should make them look into it. In my opinion expungeDeletes and forceMerge should always print a warning-like message to the log, saying that it's doing something heavy and resource-wasteful. Optimize aditionally also say that its deprecated.
          Hide
          Robert Muir added a comment -

          I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc.

          My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever.

          I think we can probably make improvements here, here are my ideas:

          1. any 'auto-optimization' in our own code is really bad. We should fix any auto/default
            optimizes so that if users want to optimize, they must specify it.
          2. any 'auto-optimization' in third-party integrations is equally bad, but we can fix this
            in a number of ways. Sure, making optimize a no-op is one solution, another is to
            actually fix the docs, ping those projects with an email or offer patches, etc.
          3. we can improve the docs to really emphasize to users how expensive manual
            optimize and expungeDeletes calls are. Personally I feel the wiki text Yonik linked to
            is way too nice about this.
          4. the name 'optimize' will always be a trap I think. Can't we start by adding 'forceMerge'
            and issue a deprecation warning if someone uses optimize (but still doing it anyway). Then
            the next step would be (in some future release), to return a hard error if someone uses
            'optimize', since eventually it gets removed.
          Show
          Robert Muir added a comment - I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc. My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever. I think we can probably make improvements here, here are my ideas: any 'auto-optimization' in our own code is really bad. We should fix any auto/default optimizes so that if users want to optimize, they must specify it. any 'auto-optimization' in third-party integrations is equally bad, but we can fix this in a number of ways. Sure, making optimize a no-op is one solution, another is to actually fix the docs, ping those projects with an email or offer patches, etc. we can improve the docs to really emphasize to users how expensive manual optimize and expungeDeletes calls are. Personally I feel the wiki text Yonik linked to is way too nice about this. the name 'optimize' will always be a trap I think. Can't we start by adding 'forceMerge' and issue a deprecation warning if someone uses optimize (but still doing it anyway). Then the next step would be (in some future release), to return a hard error if someone uses 'optimize', since eventually it gets removed.
          Hide
          Uwe Schindler added a comment -

          Serious over/under-engineering

          ??

          Show
          Uwe Schindler added a comment - Serious over/under-engineering ??
          Hide
          Jason Rutherglen added a comment -

          -1 Serious over/under-engineering.

          Show
          Jason Rutherglen added a comment - -1 Serious over/under-engineering.
          Hide
          Yonik Seeley added a comment -

          And if we did change, naive users would be:
          "oh, optimize doesn't work any more..." (looks up what it's been changed to) "ok, changed to forceMerge."

          After forceMerge is out there for a while, it would have the same problem as optimize. Someone tries it, their queries run faster, and it gets passed along as something to try to speed things up (and it is in the right scenario). The correct path here is to document it correctly, and get rid of any bad examples in our documentation.

          Someone can add a big fat message at the top of CHANGES explaining the cost of optimize and the fact that it's often less necessary than it was in the past if they want.

          Show
          Yonik Seeley added a comment - And if we did change, naive users would be: "oh, optimize doesn't work any more..." (looks up what it's been changed to) "ok, changed to forceMerge." After forceMerge is out there for a while, it would have the same problem as optimize. Someone tries it, their queries run faster, and it gets passed along as something to try to speed things up (and it is in the right scenario). The correct path here is to document it correctly, and get rid of any bad examples in our documentation. Someone can add a big fat message at the top of CHANGES explaining the cost of optimize and the fact that it's often less necessary than it was in the past if they want.
          Hide
          Mark Miller added a comment -

          Lots of use of string fields that are not numerics though - the product I worked on in the past only sorted by non numeric string fields, many times lots of them at once.

          I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc.

          My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever.

          Big -1 to making it a no op.

          Show
          Mark Miller added a comment - Lots of use of string fields that are not numerics though - the product I worked on in the past only sorted by non numeric string fields, many times lots of them at once. I'm coming around on this issue myself though. For the benefits, optimize is not a good name. It calls out to be called. The abuse is clearly there, and we should probably try more to address it than just doc. My opinion is coming around to leave it for 3.x, change it to an expert option for 4 that works the same, is understated, and is called forceMerge or whatever. Big -1 to making it a no op.
          Hide
          Uwe Schindler added a comment -

          Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries.

          This goes in the same direction as my answer to Mark: With sortMissingLast support on numerics, numerics as Strings are no longer needed. So the solution here is to use real numerics.

          Show
          Uwe Schindler added a comment - Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries. This goes in the same direction as my answer to Mark: With sortMissingLast support on numerics, numerics as Strings are no longer needed. So the solution here is to use real numerics.
          Hide
          Yonik Seeley added a comment -

          I am always talking about relevance-ranked results and numerics.

          And those are often not the bottleneck for Solr users.

          There are a few issues here:

          • the queries we often see in the field can be vastly more complex than the standard ones that lucene tests with
          • people are often most concerned with their slowest queries, not their average query speed (as long as they can meet throughput needs)
          • full-text search is often not the bottleneck at all

          Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries.

          There are tradeoffs to a lot of these things, and we should be careful to not fall into a "one size fits all" mentality.

          Show
          Yonik Seeley added a comment - I am always talking about relevance-ranked results and numerics. And those are often not the bottleneck for Solr users. There are a few issues here: the queries we often see in the field can be vastly more complex than the standard ones that lucene tests with people are often most concerned with their slowest queries, not their average query speed (as long as they can meet throughput needs) full-text search is often not the bottleneck at all Another issue that I've seen a couple of customers hit: big memory increases in the field cache as the number of segments grows. The string index values are not shared per-segment, and hence in the worst case, 2 times the number of segments equals almost 2 times the memory for the per-segment FieldCache entries. There are tradeoffs to a lot of these things, and we should be careful to not fall into a "one size fits all" mentality.
          Hide
          Mark Miller added a comment -

          With StringIndex sorting there is certainly an overhead, but as we support sortMissingLast now also for numerics, almost nobody has to use it.

          Ah, okay - that makes sense.

          Show
          Mark Miller added a comment - With StringIndex sorting there is certainly an overhead, but as we support sortMissingLast now also for numerics, almost nobody has to use it. Ah, okay - that makes sense.
          Hide
          Jan Høydahl added a comment -

          The Python Django-solr search library ALWAYS calls optimize after adding documents, see indexing.py:
          http://code.google.com/p/django-solr-search/source/search?q=optimize+commit&origq=optimize+commit&btnG=Search+Trunk
          I had a customer using this library to batch-load a bunch of documents, and it took AGES and almost killed the JVM.

          Show
          Jan Høydahl added a comment - The Python Django-solr search library ALWAYS calls optimize after adding documents, see indexing.py: http://code.google.com/p/django-solr-search/source/search?q=optimize+commit&origq=optimize+commit&btnG=Search+Trunk I had a customer using this library to batch-load a bunch of documents, and it took AGES and almost killed the JVM.
          Hide
          Uwe Schindler added a comment -

          100 segments?

          In comparison the numbers for Lucene 2.9 lowered extensively, pre-2.9 optimizing was often a must, I agree! The problem was Multi* with itsself having priority-queue like structures slowing down term enumeration and postings rerieval. With Lucene 3.x the difference between an optimized and a "standard 8 segment index" was always below measurement uncertainity (see lots of benchmarks from Mike on Lucene 4). For standard relevance-ranked or numerics sorting there was never a real slowdown.

          I am always talking about relevance-ranked results and numerics. With StringIndex sorting there is certainly an overhead, but as we support sortMissingLast now also for numerics, almost nobody has to use it.

          Show
          Uwe Schindler added a comment - 100 segments? In comparison the numbers for Lucene 2.9 lowered extensively, pre-2.9 optimizing was often a must, I agree! The problem was Multi* with itsself having priority-queue like structures slowing down term enumeration and postings rerieval. With Lucene 3.x the difference between an optimized and a "standard 8 segment index" was always below measurement uncertainity (see lots of benchmarks from Mike on Lucene 4). For standard relevance-ranked or numerics sorting there was never a real slowdown. I am always talking about relevance-ranked results and numerics. With StringIndex sorting there is certainly an overhead, but as we support sortMissingLast now also for numerics, almost nobody has to use it.
          Hide
          Mark Miller added a comment -

          With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue.

          Do we have benchmarks for this in some issue - would love to see some numbers.

          So, in the past, sorting certainly added a cost to multiple segments as you moved from segment to segment - did that go away in some issue? That must be completely different code these days if 100 segments or more performs like one.

          Show
          Mark Miller added a comment - With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue. Do we have benchmarks for this in some issue - would love to see some numbers. So, in the past, sorting certainly added a cost to multiple segments as you moved from segment to segment - did that go away in some issue? That must be completely different code these days if 100 segments or more performs like one.
          Hide
          Uwe Schindler added a comment -

          We can even handle that:

          If somebody passes optimize=true to the update request handle, we dont do anything (no optimize) and instead print a warning message to the log saying, that optimize was disabled in Luecen because it has no positive effect on most installations. It should also metion, that there is a new forceMerge, but people should not call it unless they exactly know what they are doing.

          The above examples and a lot of more "howtos" on the web make the users think, they have to optimize (after every single add). After that they complain how slow solr is. Is this really what you want.

          The FIZ Karslruhe eSciDoc projects develops the so called Europeana project, which is supposed to index all cultural content from Europe. They are using Fedora as repository, so the above issue was like a no-go for them to use GSearch (based on Solr). If you have so many misinformation about optimize on the net, the most reasonable approach is to simply disable the feature in quesion to prevent further harm.

          People that rely on optimize (because they want their statistics 100% correct) will get informed by the warning messages in the logs. For them its almost a one-line code change in their Solr client. If they dont do it, they will also not be disaapointed, because:

          There is less of a slowdown - but it's certainly still there

          So they would in most cases not even recognizing because new versions of solr will bring other improvements.

          Show
          Uwe Schindler added a comment - We can even handle that: If somebody passes optimize=true to the update request handle, we dont do anything (no optimize) and instead print a warning message to the log saying, that optimize was disabled in Luecen because it has no positive effect on most installations. It should also metion, that there is a new forceMerge, but people should not call it unless they exactly know what they are doing. The above examples and a lot of more "howtos" on the web make the users think, they have to optimize (after every single add). After that they complain how slow solr is. Is this really what you want. The FIZ Karslruhe eSciDoc projects develops the so called Europeana project, which is supposed to index all cultural content from Europe. They are using Fedora as repository, so the above issue was like a no-go for them to use GSearch (based on Solr). If you have so many misinformation about optimize on the net, the most reasonable approach is to simply disable the feature in quesion to prevent further harm. People that rely on optimize (because they want their statistics 100% correct) will get informed by the warning messages in the logs. For them its almost a one-line code change in their Solr client. If they dont do it, they will also not be disaapointed, because: There is less of a slowdown - but it's certainly still there So they would in most cases not even recognizing because new versions of solr will bring other improvements.
          Hide
          Yonik Seeley added a comment -

          This is really a documentation issue. I took a shot at improving it here:
          http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22

          Are there other places in the docs we need to improve (by either adding details, or removing the example altogether)?

          Show
          Yonik Seeley added a comment - This is really a documentation issue. I took a shot at improving it here: http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22 Are there other places in the docs we need to improve (by either adding details, or removing the example altogether)?
          Hide
          Yonik Seeley added a comment -

          With Lucene 3.x there is really no slowdown at all caused by multiple segments

          There is less of a slowdown - but it's certainly still there. Whether it matters or not will depend on the exact use-cases.

          Solr has some problems with facetting, but people should use per-segment facetting and not optimize

          No... people should do whatever suits their usecase best.

          Some very well informed users of Solr still optimize. They change their index infrequently (like once a day), and have determined that the performance increases they see by optimizing make it worth it for them.

          Show
          Yonik Seeley added a comment - With Lucene 3.x there is really no slowdown at all caused by multiple segments There is less of a slowdown - but it's certainly still there. Whether it matters or not will depend on the exact use-cases. Solr has some problems with facetting, but people should use per-segment facetting and not optimize No... people should do whatever suits their usecase best. Some very well informed users of Solr still optimize. They change their index infrequently (like once a day), and have determined that the performance increases they see by optimizing make it worth it for them.
          Hide
          Uwe Schindler added a comment -

          I just repeat here, what Mike already posted on the Lucene issue:

          Some quick googling uncovers depressing examples of over-optimizing:

          That last one has this fun comment:

          // Lucene recommends calling optimize upon completion of indexing writer.optimize();
          

          Most of the above items also affect Solr. E.g. the first one (I know people from FIZ Karlsruhe and Fedora) is really funny. Fedora GSearch calls optimze=true on every add of a single document to Solr. I even know people using Solr and complained about GSearch because of this.

          We can fix those horrible user-code bugs very fast by making optimize a no-op in Solr, they all will appreciate that. I just repeat: Nobody's installation would break, it would just get faster.

          Some funny detail: With Lucene 3.x, search actuall gets faster with multiple segments if you do parallel ExceutorService-based search (I still dont really recommend to use ExceutorService on IndexSearcher...). On the other hand by executing the search on a non-optimized pre 2.9 index with no per segment search was really slower, as MultiTermsEnum and MultiDocsEnum was used.

          With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue. I agree, Solr has some problems with facetting, but people should use per-segment facetting and not optimize, this would improve their installations immense (although the actual facetting might get slower, but on the other hand FieldCaches can be reused, so it actually gets faster). The current default is global facetting and (for most installations) "optimize on every single item added" (see above links).

          Show
          Uwe Schindler added a comment - I just repeat here, what Mike already posted on the Lucene issue: Some quick googling uncovers depressing examples of over-optimizing: https://jira.duraspace.org/browse/FCREPO-155 http://stackoverflow.com/questions/3912253/is-it-mandatory-to-optimize-the-lucene-index-after-write http://issues.liferay.com/browse/LPS-2944 http://download.oracle.com/docs/cd/E19316-01/820-7054/girqf/index.html https://issues.sonatype.org/browse/MNGECLIPSE-2359 http://blog.inflinx.com/tag/lucene That last one has this fun comment: // Lucene recommends calling optimize upon completion of indexing writer.optimize(); Most of the above items also affect Solr. E.g. the first one (I know people from FIZ Karlsruhe and Fedora) is really funny. Fedora GSearch calls optimze=true on every add of a single document to Solr. I even know people using Solr and complained about GSearch because of this. We can fix those horrible user-code bugs very fast by making optimize a no-op in Solr, they all will appreciate that. I just repeat: Nobody's installation would break, it would just get faster. Some funny detail: With Lucene 3.x, search actuall gets faster with multiple segments if you do parallel ExceutorService-based search (I still dont really recommend to use ExceutorService on IndexSearcher...). On the other hand by executing the search on a non-optimized pre 2.9 index with no per segment search was really slower, as MultiTermsEnum and MultiDocsEnum was used. With Lucene 3.x there is really no slowdown at all caused by multiple segments, as each segment is searched on its own with no interaction and just the results added to same priority queue. I agree, Solr has some problems with facetting, but people should use per-segment facetting and not optimize, this would improve their installations immense (although the actual facetting might get slower, but on the other hand FieldCaches can be reused, so it actually gets faster). The current default is global facetting and (for most installations) "optimize on every single item added" (see above links).
          Hide
          Uwe Schindler added a comment -

          To come back to the orginal issue:
          I am very glad that Jan opened the issue. I would suggest (as mentioned in other issues, too) to make optimize a no-op in solr and add a new forceMerge=segments with loud warnings.
          By this no existing code breaks (it just no longer optimizes).

          Is this a good idea, Yonik?

          Show
          Uwe Schindler added a comment - To come back to the orginal issue: I am very glad that Jan opened the issue. I would suggest (as mentioned in other issues, too) to make optimize a no-op in solr and add a new forceMerge=segments with loud warnings. By this no existing code breaks (it just no longer optimizes). Is this a good idea, Yonik?
          Hide
          Uwe Schindler added a comment -

          I would supply a patch, but I have no idea what config files are affected by this.

          Show
          Uwe Schindler added a comment - I would supply a patch, but I have no idea what config files are affected by this.
          Hide
          Robert Muir added a comment -

          I will open a separate issue to remove this auto-optimize in DIH.

          This seems less controversial than the whole issue.

          If someone wants to optimize, they can pass &optimize=true, it will
          only speed up most peoples applications, especially if they often
          do incremental updates from their database.

          Show
          Robert Muir added a comment - I will open a separate issue to remove this auto-optimize in DIH. This seems less controversial than the whole issue. If someone wants to optimize, they can pass &optimize=true, it will only speed up most peoples applications, especially if they often do incremental updates from their database.
          Hide
          Yonik Seeley added a comment -

          The biggest mess is DIH - it optimizes by default which is the stupidest thing it could do

          Are you saying that committers don't know the cost of optimize?
          If all the renaming craziness in lucene-land is going to creep to solr, I should start vetoing those!

          Show
          Yonik Seeley added a comment - The biggest mess is DIH - it optimizes by default which is the stupidest thing it could do Are you saying that committers don't know the cost of optimize? If all the renaming craziness in lucene-land is going to creep to solr, I should start vetoing those!
          Hide
          Uwe Schindler added a comment - - edited

          Yonik: I disagree here:
          One problem is e.g., DIH it optimizes by default which is the stupidest thing it could do on every incremental update (see SOLR-3142)

          If you disagree, I would simple (as suggested before by me) to make optimize a no-op in Solr. Very easy and hurts nobody, but prevents people from doing the wrong thing.

          Show
          Uwe Schindler added a comment - - edited Yonik: I disagree here: One problem is e.g., DIH it optimizes by default which is the stupidest thing it could do on every incremental update (see SOLR-3142 ) If you disagree, I would simple (as suggested before by me) to make optimize a no-op in Solr. Very easy and hurts nobody, but prevents people from doing the wrong thing.
          Hide
          Yonik Seeley added a comment -

          -1

          Long term API stability is very important, and this simply boils down to a documentation issue.

          If we changed the external API every time we thought of a slightly better name, things would be quite a mess. What might make sense for a Java library doesn't necessarily make sense for a server, and we have different back compatible goals. Lucene renaming something should not be reason for Solr to do so.

          Show
          Yonik Seeley added a comment - -1 Long term API stability is very important, and this simply boils down to a documentation issue. If we changed the external API every time we thought of a slightly better name, things would be quite a mess. What might make sense for a Java library doesn't necessarily make sense for a server, and we have different back compatible goals. Lucene renaming something should not be reason for Solr to do so.
          Hide
          Uwe Schindler added a comment -

          The DIH default behavior is to optimize!

          Show
          Uwe Schindler added a comment - The DIH default behavior is to optimize!
          Hide
          Jan Høydahl added a comment -

          I propose OPTIMIZE should still work in 3.x but be deprecated and yell about it in the logs. The most straight-forward is perhaps to add a new forceMerge command to replace the old one. Then from 4.0 the old optimize command would be a NOP command.

          Reasoning behind this is that <optimize/> causes a lot of people trouble in Solr today because it's over-used due to its luring name. I don't think anyone will miss it once it's gone, and those who really need it can start using <forceMerge/> which is a better name anyhow.

          Show
          Jan Høydahl added a comment - I propose OPTIMIZE should still work in 3.x but be deprecated and yell about it in the logs. The most straight-forward is perhaps to add a new forceMerge command to replace the old one. Then from 4.0 the old optimize command would be a NOP command. Reasoning behind this is that <optimize/> causes a lot of people trouble in Solr today because it's over-used due to its luring name. I don't think anyone will miss it once it's gone, and those who really need it can start using <forceMerge/> which is a better name anyhow.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jan Høydahl
            • Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development