Solr
  1. Solr
  2. SOLR-127

Make Solr more friendly to external HTTP caches

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      an offhand comment I saw recently reminded me of something that really bugged me about the serach solution i used before Solr – it didn't play nicely with HTTP caches that might be sitting in front of it.

      at the moment, Solr doesn't put in particularly usefull info in the HTTP Response headers to aid in caching (ie: Last-Modified), responds to all HEAD requests with a 400, and doesn't do anything special with If-Modified-Since.

      t the very least, we can set a Last-Modified based on when the current IndexReder was open (if not the Date on the IndexReader) and use the same info to determing how to respond to If-Modified-Since requests.

      (for the record, i think the reason this hasn't occured to me in the 2+ years i've been using Solr, is because with the internal caching, i've yet to need to put a proxy cache in front of Solr)

      1. HTTPCaching.patch
        56 kB
        Hoss Man
      2. HTTPCaching.patch
        53 kB
        Hoss Man
      3. CacheUnitTest.patch
        40 kB
        Thomas Peuss
      4. HTTPCaching.patch
        53 kB
        Hoss Man
      5. CacheUnitTest.patch
        32 kB
        Thomas Peuss
      6. HTTPCaching.patch
        30 kB
        Hoss Man
      7. HTTPCaching.patch
        29 kB
        Hoss Man
      8. HTTPCaching.patch
        34 kB
        Hoss Man
      9. HTTPCaching.patch
        33 kB
        Thomas Peuss
      10. HTTPCaching.patch
        31 kB
        Thomas Peuss
      11. HTTPCaching.patch
        31 kB
        Thomas Peuss
      12. HTTPCaching.patch
        30 kB
        Thomas Peuss
      13. HTTPCaching.patch
        38 kB
        Thomas Peuss
      14. HTTPCaching.patch
        38 kB
        Thomas Peuss
      15. HTTPCaching.patch
        38 kB
        Thomas Peuss
      16. HTTPCaching.patch
        38 kB
        Thomas Peuss
      17. HTTPCaching.patch
        22 kB
        Thomas Peuss
      18. HTTPCaching.patch
        19 kB
        Thomas Peuss
      19. HTTPCaching.patch
        9 kB
        Thomas Peuss
      20. HTTPCaching.patch
        7 kB
        Thomas Peuss
      21. HTTPCaching.patch
        7 kB
        Thomas Peuss
      22. HTTPCaching.patch
        7 kB
        Thomas Peuss
      23. HTTPCaching.patch
        6 kB
        Thomas Peuss
      24. HTTPCaching.patch
        7 kB
        Thomas Peuss
      25. HTTPCaching.patch
        5 kB
        Thomas Peuss
      26. HTTPCaching.patch
        4 kB
        Thomas Peuss
      27. HTTPCaching.patch
        3 kB
        Thomas Peuss

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Just noticed a small (and functionally irrelevant) typo in solrconfig.xml of the example dir:

          that was intentional actually ... if you uncomment that line, you have to comment out the line below it which is an open <httpCaching> tag ... the closing tag is much farther down after the comments and the commented out nested <cacheControl> block. i figured it would be more obvious for people to deal with just those two lines then to have that never304="true" example be a self closing tag and make people scroll down to find the other close tag to get rid of it.

          Show
          Hoss Man added a comment - Just noticed a small (and functionally irrelevant) typo in solrconfig.xml of the example dir: that was intentional actually ... if you uncomment that line, you have to comment out the line below it which is an open <httpCaching> tag ... the closing tag is much farther down after the comments and the commented out nested <cacheControl> block. i figured it would be more obvious for people to deal with just those two lines then to have that never304="true" example be a self closing tag and make people scroll down to find the other close tag to get rid of it.
          Hide
          Walter Ferrara added a comment -

          Just noticed a small (and functionally irrelevant) typo in solrconfig.xml of the example dir:

              <!-- Set HTTP caching related parameters (for proxy caches and clients).
                    
                   To get the behaviour of Solr 1.2 (ie: no caching related headers)
                   use the never304="true" option and do not specify a value for
                   <cacheControl>
              -->
              <!-- <httpCaching never304="true"> -->
          

          look at the last line, it should be

              <!-- <httpCaching never304="true"/> -->
          

          otherwise who uncomment that will get an exception

          Show
          Walter Ferrara added a comment - Just noticed a small (and functionally irrelevant) typo in solrconfig.xml of the example dir: <!-- Set HTTP caching related parameters (for proxy caches and clients). To get the behaviour of Solr 1.2 (ie: no caching related headers) use the never304= "true" option and do not specify a value for <cacheControl> --> <!-- <httpCaching never304= "true" > --> look at the last line, it should be <!-- <httpCaching never304= "true" /> --> otherwise who uncomment that will get an exception
          Hide
          Hoss Man added a comment -

          For the record: most of this discussion should have happened on the solr-dev list, not in the issue comments ... but i would like to address some points, so I'll do it here since this is where the discussion is.

          1) It's true, there is no way to configure caching on a per request handler basis – if you look at the history of the issue we looked into that but because of the necessary API changes we scaled back the scope of the patch – it can be done, it just needs more thought into how to do it and people interested in working on it.

          2) there is no doubt in my mind that having the cache awareness code on by default is the right approach moving forward. These options don't cause Solr do do any caching, or to force any external caches to cache the pages – they only result in Solr behaving correctly according to the HTTP spec sections relating to cache headers:

          • if a request is made to Solr via an HTTP cache that cache will receive headers it can use to decide if/how-long to cache the response
          • if Solr receives a request with cache validation information then it responds with a 304
            if you don't want that behavior then either don't access Solr via a cache, or explicitly set the <httpCaching never304="true"> option; but the default behavior for people who are upgrading from 1.2 should be for Solr to emit Correct headers and to respect validation requests. Requiring Solr users to explicitly turn on an option to get Solr to emit correct Caching headers would be like requiring them to explicitly set an option to get well formed XML instead of invalid XML – the default should be the one that behaves the most correctly.

          I admit however: this is a notable enough change that it should be mentioned in the "Upgrading from 1.2" section of CHANGES.txt – I will add that.

          3) if other pending patches attached to other issues have poor behavior as a result of the caching code, the appropriate place to discuss that is in those issue – the solution may be to mark those issues dependent on a new issue to add the API hooks for request handlers to suppress caching (that's a good idea in general) but it's also possible that there are better/safer/more-logical solutions specific to those patches ... if the DataImportHandler is having problems because the caching code, i'm guessing it's because people use it to trigger updates using an HTTP GET – that violates the semantics of GET and making work arounds in the the HttpCaching code to allow for that is a bad idea.

          4) saying only the "/select" handler should get it's responses cached is missleading – under Solr 1.3 there won't be anything special about /select ... any handler name can be used for queries, and any handler name can be used for updates ... if you are issuing a request that modifies the index, you should be sending a POST and no caching headers (or validation) will be done by Solr regardless of configuration.

          As I said, discussion about the general topic of HTTP Caching, Solr, and what the defaults should be should really happen on the solr-dev list ... if there are any further comments let's please conduct them there and then open/update whatever issues we need to once a consensus has been reached.

          Show
          Hoss Man added a comment - For the record: most of this discussion should have happened on the solr-dev list, not in the issue comments ... but i would like to address some points, so I'll do it here since this is where the discussion is. 1) It's true, there is no way to configure caching on a per request handler basis – if you look at the history of the issue we looked into that but because of the necessary API changes we scaled back the scope of the patch – it can be done, it just needs more thought into how to do it and people interested in working on it. 2) there is no doubt in my mind that having the cache awareness code on by default is the right approach moving forward. These options don't cause Solr do do any caching, or to force any external caches to cache the pages – they only result in Solr behaving correctly according to the HTTP spec sections relating to cache headers: if a request is made to Solr via an HTTP cache that cache will receive headers it can use to decide if/how-long to cache the response if Solr receives a request with cache validation information then it responds with a 304 if you don't want that behavior then either don't access Solr via a cache, or explicitly set the <httpCaching never304="true"> option; but the default behavior for people who are upgrading from 1.2 should be for Solr to emit Correct headers and to respect validation requests. Requiring Solr users to explicitly turn on an option to get Solr to emit correct Caching headers would be like requiring them to explicitly set an option to get well formed XML instead of invalid XML – the default should be the one that behaves the most correctly. I admit however: this is a notable enough change that it should be mentioned in the "Upgrading from 1.2" section of CHANGES.txt – I will add that. 3) if other pending patches attached to other issues have poor behavior as a result of the caching code, the appropriate place to discuss that is in those issue – the solution may be to mark those issues dependent on a new issue to add the API hooks for request handlers to suppress caching (that's a good idea in general) but it's also possible that there are better/safer/more-logical solutions specific to those patches ... if the DataImportHandler is having problems because the caching code, i'm guessing it's because people use it to trigger updates using an HTTP GET – that violates the semantics of GET and making work arounds in the the HttpCaching code to allow for that is a bad idea. 4) saying only the "/select" handler should get it's responses cached is missleading – under Solr 1.3 there won't be anything special about /select ... any handler name can be used for queries, and any handler name can be used for updates ... if you are issuing a request that modifies the index, you should be sending a POST and no caching headers (or validation) will be done by Solr regardless of configuration. As I said, discussion about the general topic of HTTP Caching, Solr, and what the defaults should be should really happen on the solr-dev list ... if there are any further comments let's please conduct them there and then open/update whatever issues we need to once a consensus has been reached.
          Hide
          Shalin Shekhar Mangar added a comment -

          I've opened SOLR-506 to have this feature configurable on a per-handler basis.

          Thanks Thomas for starting SOLR-505, together these two issues should lead to an 'ideal' solution

          Show
          Shalin Shekhar Mangar added a comment - I've opened SOLR-506 to have this feature configurable on a per-handler basis. Thanks Thomas for starting SOLR-505 , together these two issues should lead to an 'ideal' solution
          Hide
          Thomas Peuss added a comment - - edited

          This is indeed a useful feature for those who use a caching proxy in front. But those users are educated enough to configure it in solrconfig.xml if they need it .( BTW , We use Solr extensively and we have no caching in front of Solr )

          True. We should disable the cache header stuff by default. Please open a new JIRA issue for that.

          In an ideal situation the 'select' handler must have it enabled by default. For all other handlers keep it off by default and provide an option to enable it (if needed)

          Exactly. We need to get a bit more specific here. I have opened SOLR-505 for that.

          Show
          Thomas Peuss added a comment - - edited This is indeed a useful feature for those who use a caching proxy in front. But those users are educated enough to configure it in solrconfig.xml if they need it .( BTW , We use Solr extensively and we have no caching in front of Solr ) True. We should disable the cache header stuff by default. Please open a new JIRA issue for that. In an ideal situation the 'select' handler must have it enabled by default. For all other handlers keep it off by default and provide an option to enable it (if needed) Exactly. We need to get a bit more specific here. I have opened SOLR-505 for that.
          Hide
          Noble Paul added a comment -

          If we look at the problem that this feature is trying to solve, only the 'select' handler should need this . So making it 'enabled' by default for all handlers does not serve any purpose.

          This is indeed a useful feature for those who use a caching proxy in front. But those users are educated enough to configure it in solrconfig.xml if they need it .( BTW , We use Solr extensively and we have no caching in front of Solr )

          In an ideal situation the 'select' handler must have it enabled by default.
          For all other handlers keep it off by default and provide an option to enable it (if needed)

          Show
          Noble Paul added a comment - If we look at the problem that this feature is trying to solve, only the 'select' handler should need this . So making it 'enabled' by default for all handlers does not serve any purpose. This is indeed a useful feature for those who use a caching proxy in front. But those users are educated enough to configure it in solrconfig.xml if they need it .( BTW , We use Solr extensively and we have no caching in front of Solr ) In an ideal situation the 'select' handler must have it enabled by default. For all other handlers keep it off by default and provide an option to enable it (if needed)
          Hide
          Thomas Peuss added a comment -

          It seems there is no way to disable caching on a per-handler basis.

          True. And we should work to a point where we can configure this per handler.

          I've read through the comments on this issue but I'm still not convinced as to why we need to enable HTTP Caching by default. The way I see it is that using a HTTP Caching Proxy in front of SOLR is a very rare use case and people using it in their deployments can always go and enable caching in solrconfig. The downside of enabling this by default is that there is no way right now to disable it on a per-handler basis and even if there was a way, everyone would have to explicitly do it in their configuration and is something that we would have to educate users unnecessarily.

          I have no problem with disabling caching headers by default. We might need a functionality where some back-end module can veto on emitting cache headers or can tell the cache header code to emit cache headers that avoid caching of the response. This is not too hard to implement. I have a look into this tonight. We can simply add two methods to the SolrQueryResponse class (like void setAvoidHTTPCaching(boolean) and boolean isAvoidHTTPCaching() - the default for the value would be false). The update request handlers should set this to true all the time. The partial response stuff can set this to true as well.

          Another way of getting around emitting cache headers on a per request basis is to use POST requests. For POST requests we do not emit cache related headers or Not Modified responses completely following the W3C specs here.

          And while thinking about that I realize that we need to extend the tests as well that we make sure that we never emit cache related headers in case of errors.

          And still you can already disable caching header related functionality by adding

             <httpCaching never304="true">
          

          to your solrconfig.xml.

          I appreciate the work you all have put into this issue and all I'm trying to say is that a feature used very rarely should not be enabled by default. I'd like to vote to go back to Solr 1.2 compatibility by default.

          In my world caching proxies and loadbalancers are the default. This might influence my view on that stuff.

          Show
          Thomas Peuss added a comment - It seems there is no way to disable caching on a per-handler basis. True. And we should work to a point where we can configure this per handler. I've read through the comments on this issue but I'm still not convinced as to why we need to enable HTTP Caching by default. The way I see it is that using a HTTP Caching Proxy in front of SOLR is a very rare use case and people using it in their deployments can always go and enable caching in solrconfig. The downside of enabling this by default is that there is no way right now to disable it on a per-handler basis and even if there was a way, everyone would have to explicitly do it in their configuration and is something that we would have to educate users unnecessarily. I have no problem with disabling caching headers by default. We might need a functionality where some back-end module can veto on emitting cache headers or can tell the cache header code to emit cache headers that avoid caching of the response. This is not too hard to implement. I have a look into this tonight. We can simply add two methods to the SolrQueryResponse class (like void setAvoidHTTPCaching(boolean) and boolean isAvoidHTTPCaching() - the default for the value would be false ). The update request handlers should set this to true all the time. The partial response stuff can set this to true as well. Another way of getting around emitting cache headers on a per request basis is to use POST requests. For POST requests we do not emit cache related headers or Not Modified responses completely following the W3C specs here. And while thinking about that I realize that we need to extend the tests as well that we make sure that we never emit cache related headers in case of errors. And still you can already disable caching header related functionality by adding <httpCaching never304="true"> to your solrconfig.xml. I appreciate the work you all have put into this issue and all I'm trying to say is that a feature used very rarely should not be enabled by default. I'd like to vote to go back to Solr 1.2 compatibility by default. In my world caching proxies and loadbalancers are the default. This might influence my view on that stuff.
          Hide
          Shalin Shekhar Mangar added a comment -

          It seems there is no way to disable caching on a per-handler basis. I've read through the comments on this issue but I'm still not convinced as to why we need to enable HTTP Caching by default. The way I see it is that using a HTTP Caching Proxy in front of SOLR is a very rare use case and people using it in their deployments can always go and enable caching in solrconfig. The downside of enabling this by default is that there is no way right now to disable it on a per-handler basis and even if there was a way, everyone would have to explicitly do it in their configuration and is something that we would have to educate users unnecessarily.

          Our use case is the SOLR-469 DataImportHandler, which should not have responses cached at any time. But there is no way for me to do it currently. I'm sure there will be other use cases too e.g. SOLR-502 for which partial results are also cached right now.

          I appreciate the work you all have put into this issue and all I'm trying to say is that a feature used very rarely should not be enabled by default. I'd like to vote to go back to Solr 1.2 compatibility by default.

          Show
          Shalin Shekhar Mangar added a comment - It seems there is no way to disable caching on a per-handler basis. I've read through the comments on this issue but I'm still not convinced as to why we need to enable HTTP Caching by default. The way I see it is that using a HTTP Caching Proxy in front of SOLR is a very rare use case and people using it in their deployments can always go and enable caching in solrconfig. The downside of enabling this by default is that there is no way right now to disable it on a per-handler basis and even if there was a way, everyone would have to explicitly do it in their configuration and is something that we would have to educate users unnecessarily. Our use case is the SOLR-469 DataImportHandler, which should not have responses cached at any time. But there is no way for me to do it currently. I'm sure there will be other use cases too e.g. SOLR-502 for which partial results are also cached right now. I appreciate the work you all have put into this issue and all I'm trying to say is that a feature used very rarely should not be enabled by default. I'd like to vote to go back to Solr 1.2 compatibility by default.
          Hide
          Hoss Man added a comment -

          Committed revision 630037.

          And I updated the SolrConfigXml wiki page to mention the new config options.

          Thank you very much for all your hard work on this Thomas!

          Show
          Hoss Man added a comment - Committed revision 630037. And I updated the SolrConfigXml wiki page to mention the new config options. Thank you very much for all your hard work on this Thomas!
          Hide
          Hoss Man added a comment -

          Changes made in this version...

          1) refactored etag cache to be core specific.

          2) change etag calculation so that (common case) minor incriments in openTime/lastModTime affect the earlier chars of the etag for faster equals comparisons (using Long.reverse)

          3) refactor config reading into SolrConfig so they don't happen on every request (the max-age regex was my main concern)

          4) refactored a bit more common code into the abstract test base

          Comments welcome (particularly since the multicore weakref stuff isn't something I've given a huge amount of thought to before).

          I haven't done enough manual testing to be satisfied that it's working 100%, but i think everything works as desired. (I would still like to see more unit tests of the different config variations, but it's not a huge problem or anything ... we've got the 80/20 rule going for us, there's probably other areas of the code that are more deserving of additional tests)

          Show
          Hoss Man added a comment - Changes made in this version... 1) refactored etag cache to be core specific. 2) change etag calculation so that (common case) minor incriments in openTime/lastModTime affect the earlier chars of the etag for faster equals comparisons (using Long.reverse) 3) refactor config reading into SolrConfig so they don't happen on every request (the max-age regex was my main concern) 4) refactored a bit more common code into the abstract test base Comments welcome (particularly since the multicore weakref stuff isn't something I've given a huge amount of thought to before). I haven't done enough manual testing to be satisfied that it's working 100%, but i think everything works as desired. (I would still like to see more unit tests of the different config variations, but it's not a huge problem or anything ... we've got the 80/20 rule going for us, there's probably other areas of the code that are more deserving of additional tests)
          Hide
          Hoss Man added a comment -

          checkpoint: unification of the most recent HTTPCaching.patch and Thomas's last CacheUnitTest.patch

          (note: Thomas, if we have any more iterations of changes to the patches related to testing, it would probably be better to just keep generating single unified patch containing everything ... having multiple patches attached to an issue is fine as long s they don't overlap, but it gets really difficult to apply multiple patches when they both add (or modify) the same files)

          next step is some MultiCore aware stuff i mentioned before .. working on that now.

          Show
          Hoss Man added a comment - checkpoint: unification of the most recent HTTPCaching.patch and Thomas's last CacheUnitTest.patch (note: Thomas, if we have any more iterations of changes to the patches related to testing, it would probably be better to just keep generating single unified patch containing everything ... having multiple patches attached to an issue is fine as long s they don't overlap, but it gets really difficult to apply multiple patches when they both add (or modify) the same files) next step is some MultiCore aware stuff i mentioned before .. working on that now.
          Hide
          Fuad Efendi added a comment -

          Fortunately, we are not using 404 trying to retrieve removed document... In initial design (I believe) SOLR developers simply wrapped all exceptions into 400, and "empty result set" is not an exception.

          Show
          Fuad Efendi added a comment - Fortunately, we are not using 404 trying to retrieve removed document... In initial design (I believe) SOLR developers simply wrapped all exceptions into 400, and "empty result set" is not an exception.
          Hide
          Fuad Efendi added a comment -

          Thomas, Walter,

          Finally I agree, thanks!

          Middleware should not send/reroute "If-Modified-Since", and should not implement internal cache (in provided by me "contr"-sample): with caching enabled, it will simply retrieve cached content.

          I do not agree with 400, it is place for DoS attacks. "Query parsing error" should be 200 with caching response codes. Of course, I know RFC 2616.

          Show
          Fuad Efendi added a comment - Thomas, Walter, Finally I agree, thanks! Middleware should not send/reroute "If-Modified-Since", and should not implement internal cache (in provided by me "contr"-sample): with caching enabled, it will simply retrieve cached content. I do not agree with 400, it is place for DoS attacks. "Query parsing error" should be 200 with caching response codes. Of course, I know RFC 2616.
          Hide
          Fuad Efendi added a comment -

          Regarding HTTP-Caching-Load-Balancer between SOLR and Middleware:
          You need to deal with additional internal http-cache at middleware. In most cases Middleware generates content from different sources and can't reroute "If-Modified-Since" request to SOLR without internal caching. For instance, if you are using SOLRJ, you have to implement additional cache for SolrDocument...

          Show
          Fuad Efendi added a comment - Regarding HTTP-Caching-Load-Balancer between SOLR and Middleware: You need to deal with additional internal http-cache at middleware. In most cases Middleware generates content from different sources and can't reroute "If-Modified-Since" request to SOLR without internal caching. For instance, if you are using SOLRJ, you have to implement additional cache for SolrDocument...
          Hide
          Walter Underwood added a comment -

          Two reasons to do HTTP caching for Solr: First, Solr is HTTP and needs to implement that correctly. Second, caches are much harder to implement and test than the cache information in HTTP. HTTP caches already exist and are well tested, so the implementation cost is zero and deployment is very easy.

          The HTTP spec already covers which responses should be cached. A 400 response may only be cached if it includes explicit cache control headers which allow that. See RFC 2616.

          We are using a caching load balancer and caching in Apache front ends to Tomcat. We see an increase of more than 2X in the capacity of our search farm.

          I would recommend against Solr-specific cache information in the XML part of the responses. Distributed caching is extremely difficult to get right. Around 25% of the HTTP 1.1 spec is devoted to caching and there are still grey areas.

          Show
          Walter Underwood added a comment - Two reasons to do HTTP caching for Solr: First, Solr is HTTP and needs to implement that correctly. Second, caches are much harder to implement and test than the cache information in HTTP. HTTP caches already exist and are well tested, so the implementation cost is zero and deployment is very easy. The HTTP spec already covers which responses should be cached. A 400 response may only be cached if it includes explicit cache control headers which allow that. See RFC 2616. We are using a caching load balancer and caching in Apache front ends to Tomcat. We see an increase of more than 2X in the capacity of our search farm. I would recommend against Solr-specific cache information in the XML part of the responses. Distributed caching is extremely difficult to get right. Around 25% of the HTTP 1.1 spec is devoted to caching and there are still grey areas.
          Hide
          Fuad Efendi added a comment -

          In my configuration I do not need SOLR caching at all; but I use HTTP caching more effectively.

          HTTPD memory- and disk- cache is used between Client and Middleware. No any caching between Middleware and SOLR. Middleware responds to HTTPD with "304" if necessary, with correct Last-Modified etc., and request do not reach SOLR. This caching configuration works fine with AJAX too, without SOLR's caching headers.

          I've seen unnecessary extra-work with this implementation... taking long time... and tried to point on some meanings of response codes (for Web).

          Show
          Fuad Efendi added a comment - In my configuration I do not need SOLR caching at all; but I use HTTP caching more effectively. HTTPD memory- and disk- cache is used between Client and Middleware. No any caching between Middleware and SOLR. Middleware responds to HTTPD with "304" if necessary, with correct Last-Modified etc., and request do not reach SOLR. This caching configuration works fine with AJAX too, without SOLR's caching headers. I've seen unnecessary extra-work with this implementation... taking long time... and tried to point on some meanings of response codes (for Web).
          Hide
          Fuad Efendi added a comment -

          I agree.
          Caching Load Balancer between SOLR and APP Servers is excellent idea, and it can be "black box" without any knowlege about SOLR API.
          AJAX can use internal cache of web browser; FLEX probably too...
          Question: do we need caching of static (non-changed) content from SOLR such as "400: Query parsing error"?..

          Show
          Fuad Efendi added a comment - I agree. Caching Load Balancer between SOLR and APP Servers is excellent idea, and it can be "black box" without any knowlege about SOLR API. AJAX can use internal cache of web browser; FLEX probably too... Question: do we need caching of static (non-changed) content from SOLR such as "400: Query parsing error"?..
          Hide
          Thomas Peuss added a comment -

          Think of two scenarios:

          • An AJAXified browser client sending requests to Solr. Caching of unchanged data in the client and corporate caching proxies speeds up things.
          • A cluster of Solr servers behind a loadbalancer with caching functionality. Middleware sends requests to Solr through the loadbalancer. Repeating requests to unchanged data are responded directly from LB cache without putting load to the Solr servers. This is for example our scenario.

          Our code works fine with BlueCoat Webcache, Apache HTTPD proxy cache, Squid proxy cache and many other solutions because we are following standards here. So I don't really get the point of your comment.

          Besides that you can completely disable this HTTP header stuff in solrconfig.xml if you don't want it.

          Show
          Thomas Peuss added a comment - Think of two scenarios: An AJAXified browser client sending requests to Solr. Caching of unchanged data in the client and corporate caching proxies speeds up things. A cluster of Solr servers behind a loadbalancer with caching functionality. Middleware sends requests to Solr through the loadbalancer. Repeating requests to unchanged data are responded directly from LB cache without putting load to the Solr servers. This is for example our scenario. Our code works fine with BlueCoat Webcache, Apache HTTPD proxy cache, Squid proxy cache and many other solutions because we are following standards here. So I don't really get the point of your comment. Besides that you can completely disable this HTTP header stuff in solrconfig.xml if you don't want it.
          Hide
          Fuad Efendi added a comment -

          Of course ETag etc. will synchronize caches; but anyway why do we need such features of HTTP specs?

          HTTP Caching is widely used to cache responces from HTTP Servers, content (HTML, PDF, JPG, EXE) can be cached at coprorate proxy, and locally in Internet Explorer's internal cache. That is the main idea.

          Are SOLR-XML responses roving the world and reaching internal cache of Mozilla Firefox, or corporate caching proxies?

          -Not.

          Clients of SOLR: Middleware. Do they need to act as "caching-proxy"? May be.... Just another use case: middleware publishes "current time" & "weather" together with response from SOLR; middleware wants to cache responses from SOLR and do not rely on requests coming from end users because of frequent weather changes - it depends on implementation of such middleware, for sure, it will try to cache SolrDocument objects instead of pure XML, and such kind of caching is not HTTP-related.

          Show
          Fuad Efendi added a comment - Of course ETag etc. will synchronize caches; but anyway why do we need such features of HTTP specs? HTTP Caching is widely used to cache responces from HTTP Servers, content (HTML, PDF, JPG, EXE) can be cached at coprorate proxy, and locally in Internet Explorer's internal cache. That is the main idea. Are SOLR-XML responses roving the world and reaching internal cache of Mozilla Firefox, or corporate caching proxies? -Not. Clients of SOLR: Middleware. Do they need to act as "caching-proxy"? May be.... Just another use case: middleware publishes "current time" & "weather" together with response from SOLR; middleware wants to cache responses from SOLR and do not rely on requests coming from end users because of frequent weather changes - it depends on implementation of such middleware, for sure, it will try to cache SolrDocument objects instead of pure XML, and such kind of caching is not HTTP-related.
          Hide
          Fuad Efendi added a comment -

          This is an alternative to initially proposed HTTP-caching, and it is extremely easy to implement:

          Simply add request parameter http.header="If-Modified-Since: Tue, 05 Feb 2008 03:50:00 GMT" (better is to use other names, do not use http.header parameter; see below...)
          Let SOLR to respond via standard XML message "Not Modified", and avoid using 304 response code

          What do you think? We can even encapsulate MAX-AGE, EXPIRES, and other useful stuff (like as additional UPDATE-FREQUENCY: 30 days) into XML, and all those staff can depend on internal Lucene statistics (and not on hard-coded values in SOLR-CONFIG).

          We should not use HTTP-Protocol response headers such as 304/400/500 to describe SOLR's external API.

          Sample: Apache HTTPD front-end, Tomcat (Struts-based middleware), and SOLR (backend). With your initial proposal different users will get different data. Why? Multithreading at Apache HTTPD. At least, there are some possible fluctuations, cache is not shared in some configurations, etc. Each thread may get own copy of "last-modified", and different users will see different data. It won't work for most business cases.

          Without HTTP:
          "is modified?"
          "when is next update of BOOKS category?"

          • all caches around the world have the same timestamp for BOOKS category
            ... ... ...
          Show
          Fuad Efendi added a comment - This is an alternative to initially proposed HTTP-caching, and it is extremely easy to implement: Simply add request parameter http.header="If-Modified-Since: Tue, 05 Feb 2008 03:50:00 GMT" (better is to use other names, do not use http.header parameter; see below...) Let SOLR to respond via standard XML message "Not Modified", and avoid using 304 response code What do you think? We can even encapsulate MAX-AGE, EXPIRES, and other useful stuff (like as additional UPDATE-FREQUENCY: 30 days) into XML, and all those staff can depend on internal Lucene statistics (and not on hard-coded values in SOLR-CONFIG). We should not use HTTP-Protocol response headers such as 304/400/500 to describe SOLR's external API. Sample: Apache HTTPD front-end, Tomcat (Struts-based middleware), and SOLR (backend). With your initial proposal different users will get different data. Why? Multithreading at Apache HTTPD. At least, there are some possible fluctuations, cache is not shared in some configurations, etc. Each thread may get own copy of "last-modified", and different users will see different data. It won't work for most business cases. Without HTTP: "is modified?" "when is next update of BOOKS category?" all caches around the world have the same timestamp for BOOKS category ... ... ...
          Hide
          Thomas Peuss added a comment -

          The unit tests work now as expected. The problem described earlier occurred because of different behavior of the normal unit tests and the ones run with Jetty.

          Please be aware of the changes in

          • SolrDispatchFilter.java: the init method has changed
          • JettySolrRunner.java: additional constructor

          So we can now go ahead and get this into the codebase...

          Show
          Thomas Peuss added a comment - The unit tests work now as expected. The problem described earlier occurred because of different behavior of the normal unit tests and the ones run with Jetty. Please be aware of the changes in SolrDispatchFilter.java: the init method has changed JettySolrRunner.java: additional constructor So we can now go ahead and get this into the codebase...
          Hide
          Thomas Peuss added a comment -

          Thomas: each core has it's own classloader for plugins defined in the lib directory of the solr home - but the "main" Solr code (in the solr.war) is loaded by the webapp context classloader - so static variables in "core" solr code really are singletons.

          OK. Then we need a "per-core" cache. A weak-hashmap would be sufficient to achieve this. You can use the core-name as key for example.

          Would that explain the problems you are seeing in the test? does it relate to the etagCache?

          I am pretty sure that it does not relate to the etagCache. I think it is some static variable stuff in the SolrConfig parts. I try to track that down tonight when I have put my daughter to bed.

          I thought the problem was that even in the "NoCache" test it as expecting to see a Cache-Control header even though solrconfig-nocache.xml doesn't have one configured?

          This tests are wrong. You are completely right. The current code should fail in the "nocache" scenario. Currently it does not because of the problem I have described.

          (We have several tests that load cores with different configs that currently work, and we've never really noticed any problems like this before ... so i'm hesitant to assume it's unrelated to the patch)

          But only one of them (the SolrJ tests) loads the Solr code through Jetty (so it might be a Jetty related problem as well).... All other tests use the Solr code directly.

          Show
          Thomas Peuss added a comment - Thomas: each core has it's own classloader for plugins defined in the lib directory of the solr home - but the "main" Solr code (in the solr.war) is loaded by the webapp context classloader - so static variables in "core" solr code really are singletons. OK. Then we need a "per-core" cache. A weak-hashmap would be sufficient to achieve this. You can use the core-name as key for example. Would that explain the problems you are seeing in the test? does it relate to the etagCache? I am pretty sure that it does not relate to the etagCache. I think it is some static variable stuff in the SolrConfig parts. I try to track that down tonight when I have put my daughter to bed. I thought the problem was that even in the "NoCache" test it as expecting to see a Cache-Control header even though solrconfig-nocache.xml doesn't have one configured? This tests are wrong. You are completely right. The current code should fail in the "nocache" scenario. Currently it does not because of the problem I have described. (We have several tests that load cores with different configs that currently work, and we've never really noticed any problems like this before ... so i'm hesitant to assume it's unrelated to the patch) But only one of them (the SolrJ tests) loads the Solr code through Jetty (so it might be a Jetty related problem as well).... All other tests use the Solr code directly.
          Hide
          Hoss Man added a comment -

          Thomas: each core has it's own classloader for plugins defined in the lib directory of the solr home – but the "main" Solr code (in the solr.war) is loaded by the webapp context classloader – so static variables in "core" solr code really are singletons.

          Would that explain the problems you are seeing in the test? does it relate to the etagCache? I thought the problem was that even in the "NoCache" test it as expecting to see a Cache-Control header even though solrconfig-nocache.xml doesn't have one configured?

          (We have several tests that load cores with different configs that currently work, and we've never really noticed any problems like this before ... so i'm hesitant to assume it's unrelated to the patch)

          Show
          Hoss Man added a comment - Thomas: each core has it's own classloader for plugins defined in the lib directory of the solr home – but the "main" Solr code (in the solr.war) is loaded by the webapp context classloader – so static variables in "core" solr code really are singletons. Would that explain the problems you are seeing in the test? does it relate to the etagCache? I thought the problem was that even in the "NoCache" test it as expecting to see a Cache-Control header even though solrconfig-nocache.xml doesn't have one configured? (We have several tests that load cores with different configs that currently work, and we've never really noticed any problems like this before ... so i'm hesitant to assume it's unrelated to the patch)
          Hide
          Thomas Peuss added a comment -

          Fixing the test cases is not that easy. There is some caching going on somewhere inside Solr that prevents the second (solrconfig-nocache.xml) from being loaded. Well - it is loaded according to the logfile but Solr still uses the configured parameters from solrconfig.xml.

          So your worries about the caching are sound. The problems appear only at another part of Solr than expected...

          I played around with some ClassLoader tricks but that has not helped until now. A solution for the problem would be running this tests with extra processes.

          Show
          Thomas Peuss added a comment - Fixing the test cases is not that easy. There is some caching going on somewhere inside Solr that prevents the second (solrconfig-nocache.xml) from being loaded. Well - it is loaded according to the logfile but Solr still uses the configured parameters from solrconfig.xml. So your worries about the caching are sound. The problems appear only at another part of Solr than expected... I played around with some ClassLoader tricks but that has not helped until now. A solution for the problem would be running this tests with extra processes.
          Hide
          Thomas Peuss added a comment - - edited
          • the test classes still need some work, both in terms of the current failure mentioned above, and to cover more permutations of options. When we're all said and done, we'll probably want at least 3 separate sets of test/configs:
            1. default, no <httpCaching> section in config at all ... should generate Last-Mod and Etag headers and do validation, stoping/starting port should make Last-Mod change but not ETag.
            2. never304="false", lastModFrom="dirLastMod" ... should generate Last-Mod and Etag headers and do validation, no headers should change if we stop/start the port.
            3. never304="true" ... no Last-Mod of ETag headers, no 304 even if we send crazy old If-Modified-Since
          • there's also probably some refactoring that can still be done in the tests (i noticed some duplicate code that can be moved up into the Base class)

          I take care of the tests.

          • it occurred to me while adding the etagSeed that right now the etag caching is a singleton, we'll need to make this core-specific (using a WeakHashMap i guess? i'm not fond of that approach, but these are really tiny pieces of info we are caching)
          • calcLastModified and calcEtag currently assume they can get requestDispatcher/httpCaching config options from SolrConfig ... but this need to be reconciled with SOLR-350 where there is a plan to move all requestDispatcher configs to multicore.xml (but i've pointed out in that issue i'm not sure if that is necessary or makes sense.)

          When I remember right every core has its own classloader. Then every core has its own set of static fields. This is why real singletons are not that easy to do in Java.

          Show
          Thomas Peuss added a comment - - edited the test classes still need some work, both in terms of the current failure mentioned above, and to cover more permutations of options. When we're all said and done, we'll probably want at least 3 separate sets of test/configs: 1. default, no <httpCaching> section in config at all ... should generate Last-Mod and Etag headers and do validation, stoping/starting port should make Last-Mod change but not ETag. 2. never304="false", lastModFrom="dirLastMod" ... should generate Last-Mod and Etag headers and do validation, no headers should change if we stop/start the port. 3. never304="true" ... no Last-Mod of ETag headers, no 304 even if we send crazy old If-Modified-Since there's also probably some refactoring that can still be done in the tests (i noticed some duplicate code that can be moved up into the Base class) I take care of the tests. it occurred to me while adding the etagSeed that right now the etag caching is a singleton, we'll need to make this core-specific (using a WeakHashMap i guess? i'm not fond of that approach, but these are really tiny pieces of info we are caching) calcLastModified and calcEtag currently assume they can get requestDispatcher/httpCaching config options from SolrConfig ... but this need to be reconciled with SOLR-350 where there is a plan to move all requestDispatcher configs to multicore.xml (but i've pointed out in that issue i'm not sure if that is necessary or makes sense.) When I remember right every core has its own classloader. Then every core has its own set of static fields. This is why real singletons are not that easy to do in Java.
          Hide
          Hoss Man added a comment -

          (NOTE: in my last update where I listed the new options, I forgot about the "never304" option .. obviously that's still important).

          Before I made any changes, I attempted to merge the previous HTTPCaching.patch with CacheUnitTest.patch, and ran into test failure in NoCacheHeaderTest ... looking at it, i'm not sure what it's expectation was/is (seemed to expect a Cache-Control header even when no caching options were specified in the config) so i just left it alone for now.

          I have a new unified patch (code+tests) that does everything we talked about, but there's still some thing that need resolved...

          • the test classes still need some work, both in terms of the current failure mentioned above, and to cover more permutations of options. When we're all said and done, we'll probably want at least 3 separate sets of test/configs:
            1. default, no <httpCaching> section in config at all ... should generate Last-Mod and Etag headers and do validation, stoping/starting port should make Last-Mod change but not ETag.
            2. never304="false", lastModFrom="dirLastMod" ... should generate Last-Mod and Etag headers and do validation, no headers should change if we stop/start the port.
            3. never304="true" ... no Last-Mod of ETag headers, no 304 even if we send crazy old If-Modified-Since
          • there's also probably some refactoring that can still be done in the tests (i noticed some duplicate code that can be moved up into the Base class)
          • it occurred to me while adding the etagSeed that right now the etag caching is a singleton, we'll need to make this core-specific (using a WeakHashMap i guess? i'm not fond of that approach, but these are really tiny pieces of info we are caching)
          • calcLastModified and calcEtag currently assume they can get requestDispatcher/httpCaching config options from SolrConfig ... but this need to be reconciled with SOLR-350 where there is a plan to move all requestDispatcher configs to multicore.xml (but i've pointed out in that issue i'm not sure if that is necessary or makes sense.)

          Thomas: Can you take a look at the current test failure and help me understand why it's expecting a Cache-Control header? (if you want to take a stab at expanding the test case permutations too that would be cool)

          And of course, Thomas (and everyone else), please try out the code changes in the patch and the comments in the example solrconfig.xml and let me know if this looks good.

          Show
          Hoss Man added a comment - (NOTE: in my last update where I listed the new options, I forgot about the "never304" option .. obviously that's still important). Before I made any changes, I attempted to merge the previous HTTPCaching.patch with CacheUnitTest.patch, and ran into test failure in NoCacheHeaderTest ... looking at it, i'm not sure what it's expectation was/is (seemed to expect a Cache-Control header even when no caching options were specified in the config) so i just left it alone for now. I have a new unified patch (code+tests) that does everything we talked about, but there's still some thing that need resolved... the test classes still need some work, both in terms of the current failure mentioned above, and to cover more permutations of options. When we're all said and done, we'll probably want at least 3 separate sets of test/configs: default, no <httpCaching> section in config at all ... should generate Last-Mod and Etag headers and do validation, stoping/starting port should make Last-Mod change but not ETag. never304="false", lastModFrom="dirLastMod" ... should generate Last-Mod and Etag headers and do validation, no headers should change if we stop/start the port. never304="true" ... no Last-Mod of ETag headers, no 304 even if we send crazy old If-Modified-Since there's also probably some refactoring that can still be done in the tests (i noticed some duplicate code that can be moved up into the Base class) it occurred to me while adding the etagSeed that right now the etag caching is a singleton, we'll need to make this core-specific (using a WeakHashMap i guess? i'm not fond of that approach, but these are really tiny pieces of info we are caching) calcLastModified and calcEtag currently assume they can get requestDispatcher/httpCaching config options from SolrConfig ... but this need to be reconciled with SOLR-350 where there is a plan to move all requestDispatcher configs to multicore.xml (but i've pointed out in that issue i'm not sure if that is necessary or makes sense.) Thomas: Can you take a look at the current test failure and help me understand why it's expecting a Cache-Control header? (if you want to take a stab at expanding the test case permutations too that would be cool) And of course, Thomas (and everyone else), please try out the code changes in the patch and the comments in the example solrconfig.xml and let me know if this looks good.
          Hide
          Thomas Peuss added a comment - - edited

          That sounds like a plan. I love peer-reviews...

          Show
          Thomas Peuss added a comment - - edited That sounds like a plan. I love peer-reviews...
          Hide
          Hoss Man added a comment -

          When you have a cluster of slaves then their last-mods would differ - but does that really hurt? I think no.

          The funny thing is: that's what i originally thought, and then you got me worried about it : )

          I think you are right: but let's at least give people who want to have Last-Mod headers which are in sync across all slaves an option for basing it on the dir.lastModified. We'll be giving them rope to hang themselves with if they change the cacheHeaderSeed because that will change the ETag without changing the Last-Modified, but it will be soft velvety rope that probably won't hurt since most caches are either going to use the ETag or the Last-Modified – not both. (besides: as long as it's documented well, they can always force a new snapshot when they change the cacheHeaderSeed (Hmmm... "etagSeed" is better now) so that both headers change consistently.

          So to sum up...

          • three http caching related options...
            1. lastModFrom="openTime|dirLastMod" ... default is openTime
            2. etagSeed="arbitrary string" ... default is some constant (ie: "Solr")
            3. cacheControlHeader="arbitrary string" ... default is NULL
          • headers are commuted as...
            • Last-Modified = $lastModFrom
            • ETag is a hashcode of the indexVersion and $etagSeed
            • Cache-Control is $cacheControlHeader if set (otherwise no Cache-Control header)
            • Expires is $now+$maxAge if $maxAge can be found in $cacheControlHeader (otherwise no Expires header)
          • resulting behavior...
            • Default behavior (lastModFrom=openTime)...
              • Slaves with identical snapshots will have identical Etags and Last-Mod headers that may not be exact but should tend to be close, so only a little extra load around the time of a new snapshot.
              • If you rollback an index to a previous version, you will get a new Last-Mod and ETag headers.
              • Changing configs and restarting core won't cause Etag to change, but Last-Mod will because of newly opened Searcher – If you've got an index that changes semi regularly then the ETag will get updated as soon as a new version gets opened, or you can add the etagSeed option to force new ETag on startup.
            • for people who really want Last-Mod to always be in sync across all slaves (lastModFrom=dirLastMod)...
              • Last-Mod will only ever change when index changes.
              • You probably won't care about ETags, but it will stay consistent until index changes.
              • If you change configs, and you do care about ETag, you could update the etagSeed – but there's not much point since you'll also need to generate a new snapshot on your master to force a new Last-Mod header to be updated.

          does that sound good?

          (fingers crossed i can bang this out on 2007-01-31 between 13:00-18:00 America/Los_Angeles ... unless you want to beat me to it Thomas : ) )

          Show
          Hoss Man added a comment - When you have a cluster of slaves then their last-mods would differ - but does that really hurt? I think no. The funny thing is: that's what i originally thought, and then you got me worried about it : ) I think you are right: but let's at least give people who want to have Last-Mod headers which are in sync across all slaves an option for basing it on the dir.lastModified. We'll be giving them rope to hang themselves with if they change the cacheHeaderSeed because that will change the ETag without changing the Last-Modified, but it will be soft velvety rope that probably won't hurt since most caches are either going to use the ETag or the Last-Modified – not both. (besides: as long as it's documented well, they can always force a new snapshot when they change the cacheHeaderSeed (Hmmm... "etagSeed" is better now) so that both headers change consistently. So to sum up... three http caching related options... lastModFrom="openTime|dirLastMod" ... default is openTime etagSeed="arbitrary string" ... default is some constant (ie: "Solr") cacheControlHeader="arbitrary string" ... default is NULL headers are commuted as... Last-Modified = $lastModFrom ETag is a hashcode of the indexVersion and $etagSeed Cache-Control is $cacheControlHeader if set (otherwise no Cache-Control header) Expires is $now+$maxAge if $maxAge can be found in $cacheControlHeader (otherwise no Expires header) resulting behavior... Default behavior (lastModFrom=openTime)... Slaves with identical snapshots will have identical Etags and Last-Mod headers that may not be exact but should tend to be close, so only a little extra load around the time of a new snapshot. If you rollback an index to a previous version, you will get a new Last-Mod and ETag headers. Changing configs and restarting core won't cause Etag to change, but Last-Mod will because of newly opened Searcher – If you've got an index that changes semi regularly then the ETag will get updated as soon as a new version gets opened, or you can add the etagSeed option to force new ETag on startup. for people who really want Last-Mod to always be in sync across all slaves (lastModFrom=dirLastMod)... Last-Mod will only ever change when index changes. You probably won't care about ETags, but it will stay consistent until index changes. If you change configs, and you do care about ETag, you could update the etagSeed – but there's not much point since you'll also need to generate a new snapshot on your master to force a new Last-Mod header to be updated. does that sound good? (fingers crossed i can bang this out on 2007-01-31 between 13:00-18:00 America/Los_Angeles ... unless you want to beat me to it Thomas : ) )
          Hide
          Thomas Peuss added a comment -

          What about using the index opening time for last-modified and allow an arbitrary string for the cacheHeaderSeed? The opening time is guaranteed to be greater than both index-last-mod and config-last-mod. When you have a cluster of slaves then their last-mods would differ - but does that really hurt? I think no.

          Think of following scenario:

          • Slave 1 has opentime X
          • Slave 2 has opentime X+2
          • Slave 3 has opentime X+4

          When you have round-robin load balancing all clients sometime in the future hit Slave 3 and save X+4 as last-mod for the request. When they now issue a request with a conditional header (If-Modified-Since X+4) Solr on Slave 2 and 3 would send a 304 (Not-Modified) as well. When the index changes you would get a suboptimal behavior for some time - but the code would be much easier.

          This would allow us to use an arbitrary string in cacheHeaderSeed for the ETags. To put semantics in cacheHeaderSeed is error prone. I don't like that.

          I am fine with the regex solution. It is both flexible and easy to code.

          Show
          Thomas Peuss added a comment - What about using the index opening time for last-modified and allow an arbitrary string for the cacheHeaderSeed? The opening time is guaranteed to be greater than both index-last-mod and config-last-mod. When you have a cluster of slaves then their last-mods would differ - but does that really hurt? I think no. Think of following scenario: Slave 1 has opentime X Slave 2 has opentime X+2 Slave 3 has opentime X+4 When you have round-robin load balancing all clients sometime in the future hit Slave 3 and save X+4 as last-mod for the request. When they now issue a request with a conditional header (If-Modified-Since X+4) Solr on Slave 2 and 3 would send a 304 (Not-Modified) as well. When the index changes you would get a suboptimal behavior for some time - but the code would be much easier. This would allow us to use an arbitrary string in cacheHeaderSeed for the ETags. To put semantics in cacheHeaderSeed is error prone. I don't like that. I am fine with the regex solution. It is both flexible and easy to code.
          Hide
          Hoss Man added a comment -

          If we allow cacheHeaderSeed to be an arbitrary string, and only fold it into the ETag then what mechanism do we use to support the use case of lastModFrom="dirLastMod" when we eed the Last-Modified header to change because the solrconfig.xml changed?

          The problems I see with cacheHeaderVersion beeing a timestamp is that you can really break your caching headers if you put a future time stamp in there. This is not allowed by the RFC. Of course we can check for a future time stamp and give a warning and use the current time instead.

          Right, but like you say: that's a solvable problem by maxing LastMod out with the current system time.

          When I remember right XML attributes don't need a value. So we can do the following:

          But we would still have the problem of knowing to output unquoted values for certain directives (max-age, s-maxage, etc...) and quoted values for others. If we have to hardcoded all the directive names in code, they might as well be separate options. Taking in a single literal Cache-Control header string and using a regex to pull out the Expires is definitely appealing to me, but ...

          A regex solution should work as well (but should fail gracefully with a warning logged to the logfile)

          ...what kind of failure/warning are you worried about? I'm assuming that the Cache-Control string will be written verbatim, and if it matches "\bmax-age=(\d+)" we'll also output an Expires; if the regex doesnt' match, we won't (no warning either way ... it seems perfectly normal for people to have a Cache-Control header without a max-age.

          Show
          Hoss Man added a comment - If we allow cacheHeaderSeed to be an arbitrary string, and only fold it into the ETag then what mechanism do we use to support the use case of lastModFrom="dirLastMod" when we eed the Last-Modified header to change because the solrconfig.xml changed? The problems I see with cacheHeaderVersion beeing a timestamp is that you can really break your caching headers if you put a future time stamp in there. This is not allowed by the RFC. Of course we can check for a future time stamp and give a warning and use the current time instead. Right, but like you say: that's a solvable problem by maxing LastMod out with the current system time. When I remember right XML attributes don't need a value. So we can do the following: But we would still have the problem of knowing to output unquoted values for certain directives (max-age, s-maxage, etc...) and quoted values for others. If we have to hardcoded all the directive names in code, they might as well be separate options. Taking in a single literal Cache-Control header string and using a regex to pull out the Expires is definitely appealing to me, but ... A regex solution should work as well (but should fail gracefully with a warning logged to the logfile) ...what kind of failure/warning are you worried about? I'm assuming that the Cache-Control string will be written verbatim, and if it matches "\bmax-age=(\d+)" we'll also output an Expires; if the regex doesnt' match, we won't (no warning either way ... it seems perfectly normal for people to have a Cache-Control header without a max-age.
          Hide
          Thomas Peuss added a comment -

          The cacheHeaderSeed is a good idea. It is like the version number on DNS zonefile entries. The downside of such a thing is that you have to change it manually (but Solr users are clever guys ). I would see no special meaning in the seed - just a string that we mix with the version number of the index. The user can choose whatever he wants there as long as he changes it when the config changes substantially. Something like cacheHeaderSeed="20080126123300" should be as good as cacheHeaderSeed="version23". As we are caching the ETag now we can use an MD5 or SHA1 hash for the Etag as well. We simply throw the cacheHeaderSeed and the index version number into the hashing function and Base64-encode the result of the hash. With that we obfuscate the index version as well for the paranoid ones and always have an ETag of the same size independent of the length of the seed. Additionally the Etag changes completely if only one bit has changed. This makes the equals check for the Etag a bit faster as well.

          The problems I see with cacheHeaderVersion beeing a timestamp is that you can really break your caching headers if you put a future time stamp in there. This is not allowed by the RFC. Of course we can check for a future time stamp and give a warning and use the current time instead.

          When I remember right XML attributes don't need a value. So we can do the following:

          <cacheControl max-age="23" no-cache no-store must-revalidate private="Foo" qwert="666" />
          ...becomes...
          Cache-Control: max-age="23", no-cache, must-revalidate, private="Foo", asdf, qwert="666"
          

          But again a very good idea to be flexible here. But the named list syntax might be easier to handle in the code. A regex solution should work as well (but should fail gracefully with a warning logged to the logfile). max-age is the only value that is of interest for the code.

          Show
          Thomas Peuss added a comment - The cacheHeaderSeed is a good idea. It is like the version number on DNS zonefile entries. The downside of such a thing is that you have to change it manually (but Solr users are clever guys ). I would see no special meaning in the seed - just a string that we mix with the version number of the index. The user can choose whatever he wants there as long as he changes it when the config changes substantially. Something like cacheHeaderSeed="20080126123300" should be as good as cacheHeaderSeed="version23" . As we are caching the ETag now we can use an MD5 or SHA1 hash for the Etag as well. We simply throw the cacheHeaderSeed and the index version number into the hashing function and Base64-encode the result of the hash. With that we obfuscate the index version as well for the paranoid ones and always have an ETag of the same size independent of the length of the seed. Additionally the Etag changes completely if only one bit has changed. This makes the equals check for the Etag a bit faster as well. The problems I see with cacheHeaderVersion beeing a timestamp is that you can really break your caching headers if you put a future time stamp in there. This is not allowed by the RFC. Of course we can check for a future time stamp and give a warning and use the current time instead. When I remember right XML attributes don't need a value. So we can do the following: <cacheControl max-age= "23" no-cache no-store must-revalidate private = "Foo" qwert= "666" /> ...becomes... Cache-Control: max-age= "23" , no-cache, must-revalidate, private = "Foo" , asdf, qwert= "666" But again a very good idea to be flexible here. But the named list syntax might be easier to handle in the code. A regex solution should work as well (but should fail gracefully with a warning logged to the logfile). max-age is the only value that is of interest for the code.
          Hide
          Hoss Man added a comment -

          Ad 2.: Whatever we choose: Two things must be linked: changed index and/or changed config must change the Etag and the Last-Modified

          I'm not sure that this is strictly true ... if something changes the Etag, then the Last-Modified should also change, but if the Last-Modified changes the Etag doesn't necessarily have to change. consider use cases where solrconfig.xml never changes: we can use openTime for Last-Modified (in case we have to rollback to an older index), and indexVersion for the ETag - bouncing the server will change the Last-Mod because a new searcher is opened, but the Etag won't change becuase the index hasn't changed.

          here's what i'm thinking...

          • two new options (we can pobably think of better names for these)...
            1. lastModFrom="openTime|dirLastMod" ... default is dirLastMod
            2. cacheHeaderSeed="[some date format]" ... default is epoch
          • headers are commuted as...
            • Last-Modified = the max(lastModFrom, cacheHeaderSeed) ... where lastModFrom is computed using the specified value
            • ETag is a hashcode of the indexVersion and cacheHeaderSeed
          • resulting behavior...
            • Users who aren't pick get the default where slaves with identical snapshots will have identical Etags and Last-Mod headers.
            • Changing configs by default won't immediately change the Etag or Last-Mod header ... if you've got an index that changes semi regularly you can just touch the index to get new headers, or you can add the cacheHeaderSeed option with a timestamp value to force new headers on startup.
            • if you are supper paranoid about making sure your headers are always a perfect reflection of reality (even if you rollback your index to an older copy) use lastModFrom="openTime" and update the cacheHeaderSeed option every time you change your config ... downside being that in multi-slave setups every machine will generate a different Last-Mod (but the ETags should be the same)

          ...thoughts?

          One comment only: change must-revalidate="" to must-revalidate="true/false" . For no-store/no-cache as well.

          yeah, that's what i was thinking originally, except i wanted to leave out any special knowledge about what the attributes were (ie: know hardcoded list of directive names) .. any XML attribute in the config would automatically becomes a directive in the header value, if it had a value in the config, itwould have a directive value in the header..

          <cacheControl max-age="23" no-cache="" no-store="" must-revalidate="" private="Foo" asdf="" qwert="666" />
          ...becomes...
          Cache-Control: max-age="23", no-cache, must-revalidate, private="Foo", asdf, qwert="666"
          

          ...that way we don't have to worry about any HTTP extensions, people can put anything they freaking want in their Cache-Control header. What i forgot until today though is that the numeric directives in the Cache-Control header aren't suppose to be quoted (ie: max-age=23 ... not max-age="23") ... so that won't work very easily either.

          So then started thinking maybe we use the named list syntax, and let the data type tell us wether or not the value should be quoted (<str>) or not (<int>) ... but that seems awfully verbose for something this simple ... so now i'm wondering if maybe we should just make it be one big string and use a regex to look for max-age so we can set the Expires header as well.

          I'm liking the simple string + regex approach personally.

          Show
          Hoss Man added a comment - Ad 2.: Whatever we choose: Two things must be linked: changed index and/or changed config must change the Etag and the Last-Modified I'm not sure that this is strictly true ... if something changes the Etag, then the Last-Modified should also change, but if the Last-Modified changes the Etag doesn't necessarily have to change. consider use cases where solrconfig.xml never changes: we can use openTime for Last-Modified (in case we have to rollback to an older index), and indexVersion for the ETag - bouncing the server will change the Last-Mod because a new searcher is opened, but the Etag won't change becuase the index hasn't changed. here's what i'm thinking... two new options (we can pobably think of better names for these)... lastModFrom="openTime|dirLastMod" ... default is dirLastMod cacheHeaderSeed=" [some date format] " ... default is epoch headers are commuted as... Last-Modified = the max(lastModFrom, cacheHeaderSeed) ... where lastModFrom is computed using the specified value ETag is a hashcode of the indexVersion and cacheHeaderSeed resulting behavior... Users who aren't pick get the default where slaves with identical snapshots will have identical Etags and Last-Mod headers. Changing configs by default won't immediately change the Etag or Last-Mod header ... if you've got an index that changes semi regularly you can just touch the index to get new headers, or you can add the cacheHeaderSeed option with a timestamp value to force new headers on startup. if you are supper paranoid about making sure your headers are always a perfect reflection of reality (even if you rollback your index to an older copy) use lastModFrom="openTime" and update the cacheHeaderSeed option every time you change your config ... downside being that in multi-slave setups every machine will generate a different Last-Mod (but the ETags should be the same) ...thoughts? One comment only: change must-revalidate="" to must-revalidate="true/false" . For no-store/no-cache as well. yeah, that's what i was thinking originally, except i wanted to leave out any special knowledge about what the attributes were (ie: know hardcoded list of directive names) .. any XML attribute in the config would automatically becomes a directive in the header value, if it had a value in the config, itwould have a directive value in the header.. <cacheControl max-age= "23" no-cache= "" no-store=" " must-revalidate=" " private =" Foo " asdf=" " qwert=" 666" /> ...becomes... Cache-Control: max-age= "23" , no-cache, must-revalidate, private = "Foo" , asdf, qwert= "666" ...that way we don't have to worry about any HTTP extensions, people can put anything they freaking want in their Cache-Control header. What i forgot until today though is that the numeric directives in the Cache-Control header aren't suppose to be quoted (ie: max-age=23 ... not max-age="23") ... so that won't work very easily either. So then started thinking maybe we use the named list syntax, and let the data type tell us wether or not the value should be quoted (<str>) or not (<int>) ... but that seems awfully verbose for something this simple ... so now i'm wondering if maybe we should just make it be one big string and use a regex to look for max-age so we can set the Expires header as well. I'm liking the simple string + regex approach personally.
          Hide
          Thomas Peuss added a comment -

          Updated unit test. The tests for cache and no-cache tests have now been split into different files. A final update has to take place when the cache related code is stable.

          Show
          Thomas Peuss added a comment - Updated unit test. The tests for cache and no-cache tests have now been split into different files. A final update has to take place when the cache related code is stable.
          Hide
          Thomas Peuss added a comment - - edited

          <requestDispatcher handleSelect="true" >
          ...
          <!--
          Set HTTP caching related parameters (for proxy caches and clients).
          (These are the defaults)
          <httpCaching lastModifiedFrom="openTime">
          <cacheControl max-age="30" must-revalidate="" private="Foo" />
          </httpCaching>
          -->
          <!-- to prevent Solr from doing any HTTP Cache related work uncomment this... -->
          <!--
          <httpCaching never304="true" />
          -->
          <!-- to prevent Solr work, and to be really unfriendly to caches, uncomment this... -->
          <!--
          <httpCaching never304="true">
          <cacheControl max-age="0" no-cache="" no-store="" must-revalidate="" private="Foo" />
          </httpCaching>
          -->
          ...
          </requestDispatcher>

          One comment only: change must-revalidate="" to must-revalidate="true/false" . For no-store/no-cache as well.

          Show
          Thomas Peuss added a comment - - edited <requestDispatcher handleSelect="true" > ... <!-- Set HTTP caching related parameters (for proxy caches and clients). (These are the defaults) <httpCaching lastModifiedFrom="openTime"> <cacheControl max-age="30" must-revalidate="" private="Foo" /> </httpCaching> --> <!-- to prevent Solr from doing any HTTP Cache related work uncomment this... --> <!-- <httpCaching never304="true" /> --> <!-- to prevent Solr work, and to be really unfriendly to caches, uncomment this... --> <!-- <httpCaching never304="true"> <cacheControl max-age="0" no-cache="" no-store="" must-revalidate="" private="Foo" /> </httpCaching> --> ... </requestDispatcher> One comment only: change must-revalidate="" to must-revalidate="true/false" . For no-store/no-cache as well.
          Hide
          Thomas Peuss added a comment -

          Getting back to the question of the ETag though, i think it would be better to use a hashCode on the config itself ... if the index hasn't changed, and the config hasn't changed restarting Solr shouldn't make the ETag change.

          It is a good idea to use a hash of the config as well. But we need to write that down somewhere that identical slaves need identical indexes and config files as well to have the same ETag.

          "expect" ? ... uh, i have no expectations from you ... Solr is an volunteer project, no one is expected to do anything other then contribute when/where/however they can

          I know. "Expect" might have been the wrong word for that. I only want to make sure that we do not work on the same stuff. I love the peer review you get with OSS projects.

          First and foremost: do you think being able to customize the "cache awareness" of Solr on a per request handler basis is important enough that we shouldn't move forward until we figure out a way to make it work, or do you think it's useful to have a single SolrCore wide configuration for this sort of thing?

          A SolrCore wide config for this is enough IMHO.

          Assuming we're on the right track, my game plan moving forward is:
          1) i'm going to startplay around with the config options and the control flow logic to make sure we don't do 304 style validation work when we shouldn't
          2) i suggest we think/discuss the openTime/lastModified and config modified / ETag issues a little more before making any changes there
          3) the tests will need refactored so we have at least 2 variants ("doing caching right", not doing caching because we said not to") ... if you want to take a look at doing that now, that would be great - particularly since i'm not very familiar with the framework Ryan setup for doing JUnit tests that actually spin up Jetty to do the HTTP layer.

          Ad 2.: Whatever we choose: Two things must be linked: changed index and/or changed config must change the Etag and the Last-Modified time (this must be changed on config change as well!). Last-Modified must be the maximum of config file change time and index change time...
          Ad 3.: Most of the time I have spent with the unit test was to fiddle out how this Jetty stuff works... I have a look at this.

          Show
          Thomas Peuss added a comment - Getting back to the question of the ETag though, i think it would be better to use a hashCode on the config itself ... if the index hasn't changed, and the config hasn't changed restarting Solr shouldn't make the ETag change. It is a good idea to use a hash of the config as well. But we need to write that down somewhere that identical slaves need identical indexes and config files as well to have the same ETag. "expect" ? ... uh, i have no expectations from you ... Solr is an volunteer project, no one is expected to do anything other then contribute when/where/however they can I know. "Expect" might have been the wrong word for that. I only want to make sure that we do not work on the same stuff. I love the peer review you get with OSS projects. First and foremost: do you think being able to customize the "cache awareness" of Solr on a per request handler basis is important enough that we shouldn't move forward until we figure out a way to make it work, or do you think it's useful to have a single SolrCore wide configuration for this sort of thing? A SolrCore wide config for this is enough IMHO. Assuming we're on the right track, my game plan moving forward is: 1) i'm going to startplay around with the config options and the control flow logic to make sure we don't do 304 style validation work when we shouldn't 2) i suggest we think/discuss the openTime/lastModified and config modified / ETag issues a little more before making any changes there 3) the tests will need refactored so we have at least 2 variants ("doing caching right", not doing caching because we said not to") ... if you want to take a look at doing that now, that would be great - particularly since i'm not very familiar with the framework Ryan setup for doing JUnit tests that actually spin up Jetty to do the HTTP layer. Ad 2.: Whatever we choose: Two things must be linked: changed index and/or changed config must change the Etag and the Last-Modified time (this must be changed on config change as well!). Last-Modified must be the maximum of config file change time and index change time... Ad 3.: Most of the time I have spent with the unit test was to fiddle out how this Jetty stuff works... I have a look at this.
          Hide
          Hoss Man added a comment -

          checkpoint.

          made the logic changes as discussed ... etag and lastMod calculation will now only happen if needed based on config. Cache-Control header is always generated according tothe solrconfig.xml. I also did some method refacotring and renaming to try and make it a little more explicit what was happening where, and fixed two small bugs i found (1: even on HEAD request we need to execute the request because it might fail; 2) catch and ignore IAE when parsing the date conditional headers - ie: a malformed date shouldnt' cause the page to fail)

          config syntax is the same as the last patch.

          Show
          Hoss Man added a comment - checkpoint. made the logic changes as discussed ... etag and lastMod calculation will now only happen if needed based on config. Cache-Control header is always generated according tothe solrconfig.xml. I also did some method refacotring and renaming to try and make it a little more explicit what was happening where, and fixed two small bugs i found (1: even on HEAD request we need to execute the request because it might fail; 2) catch and ignore IAE when parsing the date conditional headers - ie: a malformed date shouldnt' cause the page to fail) config syntax is the same as the last patch.
          Hide
          Hoss Man added a comment -

          Incidentally, the config syntax i'm thinking might work best is something like...

            <requestDispatcher handleSelect="true" >
                 ...
                 <!--
                    Set HTTP caching related parameters (for proxy caches and clients).
                    (These are the defaults)
              <httpCaching lastModifiedFrom="openTime">
                 <cacheControl max-age="30" must-revalidate="" private="Foo" />
              </httpCaching>
              -->
              <!-- to prevent Solr from doing any HTTP Cache related work uncomment this... -->
              <!--
               <httpCaching never304="true" />
              -->
              <!-- to prevent Solr work, and to be really unfriendly to caches, uncomment this... -->
              <!--
               <httpCaching never304="true">
                 <cacheControl max-age="0" no-cache="" no-store="" must-revalidate="" private="Foo" />
              </httpCaching>
              -->
             ...
            </requestDispatcher>
          

          ...the idea being that any attribute under <cacheControl> becomes an option in the Cache-Control header .. if it has a non-empty value, then that value is echoed as well. Expires header will also be output if max-age is specified.

          Show
          Hoss Man added a comment - Incidentally, the config syntax i'm thinking might work best is something like... <requestDispatcher handleSelect= " true " > ... <!-- Set HTTP caching related parameters ( for proxy caches and clients). (These are the defaults) <httpCaching lastModifiedFrom= "openTime" > <cacheControl max-age= "30" must-revalidate= "" private =" Foo" /> </httpCaching> --> <!-- to prevent Solr from doing any HTTP Cache related work uncomment this ... --> <!-- <httpCaching never304= " true " /> --> <!-- to prevent Solr work, and to be really unfriendly to caches, uncomment this ... --> <!-- <httpCaching never304= " true " > <cacheControl max-age= "0" no-cache= "" no-store=" " must-revalidate=" " private =" Foo" /> </httpCaching> --> ... </requestDispatcher> ...the idea being that any attribute under <cacheControl> becomes an option in the Cache-Control header .. if it has a non-empty value, then that value is echoed as well. Expires header will also be output if max-age is specified.
          Hide
          Hoss Man added a comment -

          Good point. You can get around that problem by using the openTime for the ETags as well.

          yeah ... ugh ... i'm actually starting to question whether or not openTime is even the right choice for Last-Mod ... you made a really good point before about it causing Last-Mod times to differnet between multiple (identical) slaves, but at least the ETags would be in sync ... if we add openTime to the ETag we lose even that.

          my initial concern about using IndexReader.lastModified for Last-Mod was the case where someone rolls back an index, but that's really the exceptional case ... most people will probably never encounter it (and if they do, they can work around it by "touching" the segments file ... or we could have another option for it ... lastModFrom="open|disk" ... what do you think?)

          Getting back to the question of the ETag though, i think it would be better to use a hashCode on the config itself ... if the index hasn't changed, and the config hasn't changed restarting Solr shouldn't make the ETag change.

          What do you expect from me now? Should I have a look at the testcase?

          "expect" ? ... uh, i have no expectations from you ... Solr is an volunteer project, no one is expected to do anything other then contribute when/where/however they can

          seriously though: you've clearly thought about this task more then anyone else at this point, i'm just throwing out ideas and concerns, if you think i'm making stupid suggestions, or over thinking something, or not thinking hard enough about something else let me know.

          First and foremost: do you think being able to customize the "cache awareness" of Solr on a per request handler basis is important enough that we shouldn't move forward until we figure out a way to make it work, or do you think it's useful to have a single SolrCore wide configuration for this sort of thing?

          Assuming we're on the right track, my game plan moving forward is:
          1) i'm going to startplay around with the config options and the control flow logic to make sure we don't do 304 style validation work when we shouldn't
          2) i suggest we think/discuss the openTime/lastModified and config modified / ETag issues a little more before making any changes there
          3) the tests will need refactored so we have at least 2 variants ("doing caching right", not doing caching because we said not to") ... if you want to take a look at doing that now, that would be great – particularly since i'm not very familiar with the framework Ryan setup for doing JUnit tests that actually spin up Jetty to do the HTTP layer.

          Show
          Hoss Man added a comment - Good point. You can get around that problem by using the openTime for the ETags as well. yeah ... ugh ... i'm actually starting to question whether or not openTime is even the right choice for Last-Mod ... you made a really good point before about it causing Last-Mod times to differnet between multiple (identical) slaves, but at least the ETags would be in sync ... if we add openTime to the ETag we lose even that. my initial concern about using IndexReader.lastModified for Last-Mod was the case where someone rolls back an index, but that's really the exceptional case ... most people will probably never encounter it (and if they do, they can work around it by "touching" the segments file ... or we could have another option for it ... lastModFrom="open|disk" ... what do you think?) Getting back to the question of the ETag though, i think it would be better to use a hashCode on the config itself ... if the index hasn't changed, and the config hasn't changed restarting Solr shouldn't make the ETag change. What do you expect from me now? Should I have a look at the testcase? "expect" ? ... uh, i have no expectations from you ... Solr is an volunteer project, no one is expected to do anything other then contribute when/where/however they can seriously though: you've clearly thought about this task more then anyone else at this point, i'm just throwing out ideas and concerns, if you think i'm making stupid suggestions, or over thinking something, or not thinking hard enough about something else let me know. First and foremost: do you think being able to customize the "cache awareness" of Solr on a per request handler basis is important enough that we shouldn't move forward until we figure out a way to make it work, or do you think it's useful to have a single SolrCore wide configuration for this sort of thing? Assuming we're on the right track, my game plan moving forward is: 1) i'm going to startplay around with the config options and the control flow logic to make sure we don't do 304 style validation work when we shouldn't 2) i suggest we think/discuss the openTime/lastModified and config modified / ETag issues a little more before making any changes there 3) the tests will need refactored so we have at least 2 variants ("doing caching right", not doing caching because we said not to") ... if you want to take a look at doing that now, that would be great – particularly since i'm not very familiar with the framework Ryan setup for doing JUnit tests that actually spin up Jetty to do the HTTP layer.
          Hide
          Thomas Peuss added a comment -

          1) it occurs to me that the etag value needs to include some kind of hashCode for the solrconfig.xml - otherwise someone could bounce their server (without changing the index) and continue to get identical ETag headers, even if the new config options cause entirely different results to be generated (ie: new default handler params)
          (We probably ought to be including the getVersion() info from both Solr and the specified request handler as well - just in case they deploy new code that has new behavior without modifying the index, or their configs .. but i'm not really as worried about this ... i'm OK with a FAQ saying you have make a small change to your solrconfig.xml to force new ETags

          Good point. You can get around that problem by using the openTime for the ETags as well.

          2) currently, even if the configs say "don't be cache friendly" an etag is still computed, and requests are tested for validation headers (it's even possible to get a 304 if you guess the etag or pick a really old If-Modified-Since header) ... this seems like a bad idea (and i believe it violates the RFC) .. so we should make sure no special work is done relating to cache headers if the solrconfig.xml says to disable it completely.

          True. I missed that one.

          What do you expect from me now? Should I have a look at the testcase?

          Show
          Thomas Peuss added a comment - 1) it occurs to me that the etag value needs to include some kind of hashCode for the solrconfig.xml - otherwise someone could bounce their server (without changing the index) and continue to get identical ETag headers, even if the new config options cause entirely different results to be generated (ie: new default handler params) (We probably ought to be including the getVersion() info from both Solr and the specified request handler as well - just in case they deploy new code that has new behavior without modifying the index, or their configs .. but i'm not really as worried about this ... i'm OK with a FAQ saying you have make a small change to your solrconfig.xml to force new ETags Good point. You can get around that problem by using the openTime for the ETags as well. 2) currently, even if the configs say "don't be cache friendly" an etag is still computed, and requests are tested for validation headers (it's even possible to get a 304 if you guess the etag or pick a really old If-Modified-Since header) ... this seems like a bad idea (and i believe it violates the RFC) .. so we should make sure no special work is done relating to cache headers if the solrconfig.xml says to disable it completely. True. I missed that one. What do you expect from me now? Should I have a look at the testcase?
          Hide
          Hoss Man added a comment -

          revised version of Thomas's most recent patch, that removes the backwards-incompatible changes to SolrRequestHandler by moving all configuration related to caching config options into the <requestDispatcher> block...

                 <!--
                    Set HTTP caching related parameters (for proxy caches and clients).
                    
                    To get the behaviour of Solr 1.2 (ie: no caching related headers)
                    use the noCachingHeaders="true" option
                  -->
              <!-- :TODO: it would be nice to mimic the directives of the Cache-Control header more closely -->
              <httpCaching httpCacheTTL="30"
                           httpCacheForceRevalidation="false"
                           httpCacheForcePrivate="false" />
          

          ...as noted in that TODO line, i'd like to rethink what the exact options should be, but that's a minor issue compared to the functionality itself.

          NOTE: unit tests currently fail, since caching is now either on or off for the entire server, the test will probably need to be refactored into two separate tests with different configs.

          Show
          Hoss Man added a comment - revised version of Thomas's most recent patch, that removes the backwards-incompatible changes to SolrRequestHandler by moving all configuration related to caching config options into the <requestDispatcher> block... <!-- Set HTTP caching related parameters ( for proxy caches and clients). To get the behaviour of Solr 1.2 (ie: no caching related headers) use the noCachingHeaders= " true " option --> <!-- :TODO: it would be nice to mimic the directives of the Cache-Control header more closely --> <httpCaching httpCacheTTL= "30" httpCacheForceRevalidation= " false " httpCacheForcePrivate= " false " /> ...as noted in that TODO line, i'd like to rethink what the exact options should be, but that's a minor issue compared to the functionality itself. NOTE: unit tests currently fail, since caching is now either on or off for the entire server, the test will probably need to be refactored into two separate tests with different configs.
          Hide
          Hoss Man added a comment -

          Thomas: first off .. thanks a lot for putting so much effort into this. Looking over your patch, and seeing the hoops you had to jump through to get per request handler configuration, i feel bad for ever even suggesting it.

          We definitely shouldn't make a backwards incompatible change like you needed with the getDefaults() to deal with the caching. I think for now, we should stick with your earlier approach of putting the configuration in the <requestDispatcher> block ... perhaps down the road we will have an easier mechanism for per-handler overrides (maybe using the new components stuff?) but even if we do, having some default configs in <requestDispatcher> will be good.

          I've got a modified version of your patch that moves back in this direction (but keeps some of the other good stuff you've added recently) that i'll attach in a moment.

          At a higher level, i have few broader questions/concerns that we should probably think about...

          1) it occurs to me that the etag value needs to include some kind of hashCode for the solrconfig.xml – otherwise someone could bounce their server (without changing the index) and continue to get identical ETag headers, even if the new config options cause entirely different results to be generated (ie: new default handler params)
          (We probably ought to be including the getVersion() info from both Solr and the specified request handler as well – just in case they deploy new code that has new behavior without modifying the index, or their configs .. but i'm not really as worried about this ... i'm OK with a FAQ saying you have make a small change to your solrconfig.xml to force new ETags

          2) currently, even if the configs say "don't be cache friendly" an etag is still computed, and requests are tested for validation headers (it's even possible to get a 304 if you guess the etag or pick a really old If-Modified-Since header) ... this seems like a bad idea (and i believe it violates the RFC) .. so we should make sure no special work is done relating to cache headers if the solrconfig.xml says to disable it completely.

          Show
          Hoss Man added a comment - Thomas: first off .. thanks a lot for putting so much effort into this. Looking over your patch, and seeing the hoops you had to jump through to get per request handler configuration, i feel bad for ever even suggesting it. We definitely shouldn't make a backwards incompatible change like you needed with the getDefaults() to deal with the caching. I think for now, we should stick with your earlier approach of putting the configuration in the <requestDispatcher> block ... perhaps down the road we will have an easier mechanism for per-handler overrides (maybe using the new components stuff?) but even if we do, having some default configs in <requestDispatcher> will be good. I've got a modified version of your patch that moves back in this direction (but keeps some of the other good stuff you've added recently) that i'll attach in a moment. At a higher level, i have few broader questions/concerns that we should probably think about... 1) it occurs to me that the etag value needs to include some kind of hashCode for the solrconfig.xml – otherwise someone could bounce their server (without changing the index) and continue to get identical ETag headers, even if the new config options cause entirely different results to be generated (ie: new default handler params) (We probably ought to be including the getVersion() info from both Solr and the specified request handler as well – just in case they deploy new code that has new behavior without modifying the index, or their configs .. but i'm not really as worried about this ... i'm OK with a FAQ saying you have make a small change to your solrconfig.xml to force new ETags 2) currently, even if the configs say "don't be cache friendly" an etag is still computed, and requests are tested for validation headers (it's even possible to get a 304 if you guess the etag or pick a really old If-Modified-Since header) ... this seems like a bad idea (and i believe it violates the RFC) .. so we should make sure no special work is done relating to cache headers if the solrconfig.xml says to disable it completely.
          Hide
          Hoss Man added a comment -

          this patch is functionally the same as the last patch from Thomas but updated to work against the HEAD (r614702) without patch warnings.

          i'm reviewing the patch in depth now.

          Show
          Hoss Man added a comment - this patch is functionally the same as the last patch from Thomas but updated to work against the HEAD (r614702) without patch warnings. i'm reviewing the patch in depth now.
          Hide
          Thomas Peuss added a comment -
          • Performance optimization:
            • ETag is only recalculated when the index changes
            • Shorter ETag
          • Updated to trunk
          Show
          Thomas Peuss added a comment - Performance optimization: ETag is only recalculated when the index changes Shorter ETag Updated to trunk
          Hide
          Thomas Peuss added a comment -

          Updated. Even more aggressive no-cache header get emitted when httpnocache=true.

          Show
          Thomas Peuss added a comment - Updated. Even more aggressive no-cache header get emitted when httpnocache=true.
          Hide
          Thomas Peuss added a comment -

          Added the request parameter "httpnocache" (can be true or false - defaults to false) to emit "no-cache" Cache-Control headers for requests you do not want to be cached by shared caches.

          Show
          Thomas Peuss added a comment - Added the request parameter "httpnocache" (can be true or false - defaults to false) to emit "no-cache" Cache-Control headers for requests you do not want to be cached by shared caches.
          Hide
          Thomas Peuss added a comment -

          Updated to trunk. The test was failing because there was an error in the testcase.

          Show
          Thomas Peuss added a comment - Updated to trunk. The test was failing because there was an error in the testcase.
          Hide
          Thomas Peuss added a comment -

          I can reproduce the error. It seems to be caused by the multi-core stuff that has been committed yesterday...

          Show
          Thomas Peuss added a comment - I can reproduce the error. It seems to be caused by the multi-core stuff that has been committed yesterday...
          Hide
          Hoss Man added a comment -

          Thomas: I have not looked at any of the patch updates since my last comment, but it is my sincere plan to spend some serious time on this issue this week if possible – and if not, then starting on Jan2 when i get back from vacation. (it's on my "work" todo list not just my "spare time apache" todo list)

          Show
          Hoss Man added a comment - Thomas: I have not looked at any of the patch updates since my last comment, but it is my sincere plan to spend some serious time on this issue this week if possible – and if not, then starting on Jan2 when i get back from vacation. (it's on my "work" todo list not just my "spare time apache" todo list)
          Hide
          Otis Gospodnetic added a comment - - edited

          I gave my vote for this one, but ant test failed for me after I applied this patch:

          junit.framework.AssertionFailedError: Unknown request response expected:<0> but was:<400>
          at org.apache.solr.servlet.CacheHeaderTest.checkResponseBody(CacheHeaderTest.java:175)
          at org.apache.solr.servlet.CacheHeaderTest.doCacheControl(CacheHeaderTest.java:328)
          at org.apache.solr.servlet.CacheHeaderTest.testCacheControl(CacheHeaderTest.java:152)

          Show
          Otis Gospodnetic added a comment - - edited I gave my vote for this one, but ant test failed for me after I applied this patch: junit.framework.AssertionFailedError: Unknown request response expected:<0> but was:<400> at org.apache.solr.servlet.CacheHeaderTest.checkResponseBody(CacheHeaderTest.java:175) at org.apache.solr.servlet.CacheHeaderTest.doCacheControl(CacheHeaderTest.java:328) at org.apache.solr.servlet.CacheHeaderTest.testCacheControl(CacheHeaderTest.java:152)
          Hide
          Thomas Peuss added a comment -

          Minor performance update.

          Show
          Thomas Peuss added a comment - Minor performance update.
          Hide
          Thomas Peuss added a comment -

          Updated to trunk. Any chance to get this into SVN soon?

          Show
          Thomas Peuss added a comment - Updated to trunk. Any chance to get this into SVN soon?
          Hide
          Thomas Peuss added a comment -

          Updated as Otis suggested. One thing I don't like with this patch that it changes the contract of the interface SolrRequestHandler. Maybe a Solr guru can tell me how I avoid that.

          I need access to the request handlers settings before execution. getParams() does only deliver the parameters that are on the URI before execution...

          Show
          Thomas Peuss added a comment - Updated as Otis suggested. One thing I don't like with this patch that it changes the contract of the interface SolrRequestHandler. Maybe a Solr guru can tell me how I avoid that. I need access to the request handlers settings before execution. getParams() does only deliver the parameters that are on the URI before execution...
          Hide
          Otis Gospodnetic added a comment -

          Thomas, minor comment: httpCacheLivetime --> httpCacheTTL?

          Show
          Otis Gospodnetic added a comment - Thomas, minor comment: httpCacheLivetime --> httpCacheTTL?
          Hide
          Thomas Peuss added a comment -

          Updated patch inspired by Hoss Mans comments.

          Changes:

          • Cache header settings can now be set per request handler. Omitting the parameters switches off cache header generation (fall back to old behaviour).
            • <int name="httpCacheLivetime">0</int>: Set "freshness" timespan in seconds
            • <bool name="httpCacheForceRevalidation">true</bool>: controls if we emit "must-revalidate"
            • <bool name="httpCacheForcePrivate">false</bool>: constrols if we emit "private"
          • Some refactoring to make the Filter class smaller
          • Updated testcase to check that we do not emit cache headers on POST requests.
          Show
          Thomas Peuss added a comment - Updated patch inspired by Hoss Mans comments. Changes: Cache header settings can now be set per request handler. Omitting the parameters switches off cache header generation (fall back to old behaviour). <int name="httpCacheLivetime">0</int>: Set "freshness" timespan in seconds <bool name="httpCacheForceRevalidation">true</bool>: controls if we emit "must-revalidate" <bool name="httpCacheForcePrivate">false</bool>: constrols if we emit "private" Some refactoring to make the Filter class smaller Updated testcase to check that we do not emit cache headers on POST requests.
          Hide
          Hoss Man added a comment -

          In no particular order...

          Ignore my question about weak etags (w/), this is what happens when I review patches tired ... i forgot getVersion() returns a long AND i missread how weak etags work.

          I wasn't saying that i think we need to do a hash to "hide" the version, just pointing out that some people might consider it divulging more info then we should. if no one else cares, i don't care (especially if it's prohibitively expensive)

          I like the idea of not emiting caching headers in response to POST requests ... the RFCs say that POSTs by default aren't cachable right? that also seems like a reasonable solution to the issues of typical "/update" urls all having both identicle etags and urls, as well as "If-None-Match" leading to PRECON_FAIL.

          Having explicit config options for the Cache-Control header seems good .. i wonder if we should make it a requestHandler option (instead of a SolrCore option).

          In regard to this comment...
          "The default value is no-cache, no-store when the tag is not there for backward compatibility."
          ...that's not really true. Total backwards compatibility would be no new headers at all ... if someone has a surgate proxy in front of Solr 1.2, it can use it's own configs or hueristics to decide how long to cache. as soon as we include Cache-Control header that stops working.

          I think the default behavior can be "conservative" headers (Last-Modified, ETag,and must-revalidate) that's probably the best thing for new users. But ideally there should be a way to turn it off completely (it's good to have a mechanism for people upgrading to garuntee they get the same behavior as before

          Show
          Hoss Man added a comment - In no particular order... Ignore my question about weak etags (w/), this is what happens when I review patches tired ... i forgot getVersion() returns a long AND i missread how weak etags work. I wasn't saying that i think we need to do a hash to "hide" the version, just pointing out that some people might consider it divulging more info then we should. if no one else cares, i don't care (especially if it's prohibitively expensive) I like the idea of not emiting caching headers in response to POST requests ... the RFCs say that POSTs by default aren't cachable right? that also seems like a reasonable solution to the issues of typical "/update" urls all having both identicle etags and urls, as well as "If-None-Match" leading to PRECON_FAIL. Having explicit config options for the Cache-Control header seems good .. i wonder if we should make it a requestHandler option (instead of a SolrCore option). In regard to this comment... "The default value is no-cache, no-store when the tag is not there for backward compatibility." ...that's not really true. Total backwards compatibility would be no new headers at all ... if someone has a surgate proxy in front of Solr 1.2, it can use it's own configs or hueristics to decide how long to cache. as soon as we include Cache-Control header that stops working. I think the default behavior can be "conservative" headers (Last-Modified, ETag,and must-revalidate) that's probably the best thing for new users. But ideally there should be a way to turn it off completely (it's good to have a mechanism for people upgrading to garuntee they get the same behavior as before
          Hide
          Thomas Peuss added a comment -

          What do you expect as default behaviour of this patch?

          • Emit no cache related headers at all
          • Emit conservative cache related headers
            • for example max-age=0, must-revalidate - this should work with every not completely broken cache implementation without breaking anything (besides pushing performance because you offload the Solr-server)
          • Emit more "cachy" headers
            • for example max-age=600
          Show
          Thomas Peuss added a comment - What do you expect as default behaviour of this patch? Emit no cache related headers at all Emit conservative cache related headers for example max-age=0, must-revalidate - this should work with every not completely broken cache implementation without breaking anything (besides pushing performance because you offload the Solr-server) Emit more "cachy" headers for example max-age=600
          Hide
          Thomas Peuss added a comment -

          We should think about what a bad guy can do with that information: nothing. It is not an id or key that elevates your rights or something like that when you know it.

          I would opt for writing out the index version directly as well: it is fast and simple...

          Show
          Thomas Peuss added a comment - We should think about what a bad guy can do with that information: nothing . It is not an id or key that elevates your rights or something like that when you know it. I would opt for writing out the index version directly as well: it is fast and simple...
          Hide
          Yonik Seeley added a comment -

          > Index version is now an MD5 hash: I am not sure what information we really expose here. It is time consuming to create the hash.

          I guess the info reveals when the index was created (defaults to current milliseconds, and is incremented by 1 for each new committed change).

          But, doing any sort of hash on this version number alone isn't really secure since I can guess perhaps within a day of when the index was created, and there are only 84M milliseconds in a day. Since the algorithm is known, I can try them all if I want. But really, I don't see the harm in letting someone see the index version either.

          If we want to obfuscate it for some reason, we should just use something simple and fast...

          Show
          Yonik Seeley added a comment - > Index version is now an MD5 hash: I am not sure what information we really expose here. It is time consuming to create the hash. I guess the info reveals when the index was created (defaults to current milliseconds, and is incremented by 1 for each new committed change). But , doing any sort of hash on this version number alone isn't really secure since I can guess perhaps within a day of when the index was created, and there are only 84M milliseconds in a day. Since the algorithm is known, I can try them all if I want. But really, I don't see the harm in letting someone see the index version either. If we want to obfuscate it for some reason, we should just use something simple and fast...
          Hide
          Thomas Peuss added a comment -
          • Index version is now an MD5 hash: I am not sure what information we really expose here. It is time consuming to create the hash.
          • Cache-Control HTTP header can now be configured in solr-config.xml:
            <requestDispatcher handleSelect="true" >
            <Unable to render embedded object: File (--Make sure your system has some authentication before enabling remote streaming) not found. -->
            <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />

          <httpCacheControlHeader>no-cache, no-store</httpCacheControlHeader>

          </requestDispatcher>

          The default value is no-cache, no-store when the tag is not there for backward compatibility.

          Show
          Thomas Peuss added a comment - Index version is now an MD5 hash: I am not sure what information we really expose here. It is time consuming to create the hash. Cache-Control HTTP header can now be configured in solr-config.xml: <requestDispatcher handleSelect="true" > < Unable to render embedded object: File (--Make sure your system has some authentication before enabling remote streaming) not found. --> <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" /> <httpCacheControlHeader>no-cache, no-store</httpCacheControlHeader> </requestDispatcher> The default value is no-cache, no-store when the tag is not there for backward compatibility.
          Hide
          Thomas Peuss added a comment -

          ad 1.)
          I have thought about that as well. We should make it configurable. But I do not know where the best place is for the configuration.

          ad 2.)
          getVersion() delivers a long - how should that ever be converted to w/? According to W3C a weak ETag looks like this: W/"xyzzy". We always generate ETags like "xyzzy". So no problem here. Even "W/xyzzy" would be a strong ETag.

          Hashing of the version is a good idea. I add that. But be aware that generating a hash consumes a lot of extra CPU cycles...

          ad 3.)
          As the answer to a request is always the same when the index is not changed it is OK to have the same ETag for all requests IMHO. The ETag has not to be exclusive per request URL. The ETag only allows the browser to send requests like "only execute when changed". And the ETag only changes when the index has changed.

          ad 4.)
          I have thought about that as well. The problem here is that we have POSTs that change the index and POSTs that do not change the index. The semantics are according to W3C. Here a snippet from http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html (section 14.26):
          "Instead, if the request method was GET or HEAD, the server SHOULD respond with a 304 (Not Modified) response, including the cache- related header fields (particularly ETag) of one of the entities that matched. For all other request methods, the server MUST respond with a status of 412 (Precondition Failed)."

          The idea behind that seems to be that POSTs are for changing things. But we can ignore that of course.

          ad 5.)
          Maybe we should not emit cache related headers for POSTs at all?

          Show
          Thomas Peuss added a comment - ad 1.) I have thought about that as well. We should make it configurable. But I do not know where the best place is for the configuration. ad 2.) getVersion() delivers a long - how should that ever be converted to w/? According to W3C a weak ETag looks like this: W/"xyzzy". We always generate ETags like "xyzzy". So no problem here. Even "W/xyzzy" would be a strong ETag. Hashing of the version is a good idea. I add that. But be aware that generating a hash consumes a lot of extra CPU cycles... ad 3.) As the answer to a request is always the same when the index is not changed it is OK to have the same ETag for all requests IMHO. The ETag has not to be exclusive per request URL. The ETag only allows the browser to send requests like "only execute when changed". And the ETag only changes when the index has changed. ad 4.) I have thought about that as well. The problem here is that we have POSTs that change the index and POSTs that do not change the index. The semantics are according to W3C. Here a snippet from http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html (section 14.26): "Instead, if the request method was GET or HEAD, the server SHOULD respond with a 304 (Not Modified) response, including the cache- related header fields (particularly ETag) of one of the entities that matched. For all other request methods, the server MUST respond with a status of 412 (Precondition Failed)." The idea behind that seems to be that POSTs are for changing things. But we can ignore that of course. ad 5.) Maybe we should not emit cache related headers for POSTs at all?
          Hide
          Hoss Man added a comment -

          Okay, i've been learning a little more about HTTP Caching, and i looked over the latest patch. a few comments...

          1) do we really want this in all cases?...

          + resp.setHeader("Cache-Control",
          + "max-age=0, must-revalidate");

          ...that seems a little harsh. if we're going to do that it seems like it should be optional. (NOTE: it's not backwards compatible if people already have caches in front of Solr right now)

          2) I've been reading about etags ... we need to make sure we don't inadvertently output an etag with "w/" in front (indicating it's a week entity tag) ... we should future proof against changes to IndexReader.getVersion() by putting a prefix on the version when making an etag. also: should we obfuscate the version (ie: hash it) so as not to risk disclosing info to people who shouldn't have it?

          3) all etags are the same until the reader is reopened ... shouldn't they also hash on the URL? (is there a downside to multiple URLs having the same etag?)

          4) are these semantics right? send PRECON_FAIL when "If-None-Match" tag matches and request is not GET or HEAD? (what about a POSTed query?) ....

          + if(ifNoneMatchList.size()>0 && isMatchingEtag(ifNoneMatchList,etag)) {
          + if(isGETRequest || isHEADRequest)

          { + sendNotModified(resp); + }

          else

          { + sendPreconditionFailed(resp); + }

          + return true;

          5) using SolrIndexSearcher.openTime() as last-modified for query requests makes sense ... put what about updates? since RequestHandlers don't declare what they are, should we use "now" for POSTs and openTime for GET/HEAD ?

          Show
          Hoss Man added a comment - Okay, i've been learning a little more about HTTP Caching, and i looked over the latest patch. a few comments... 1) do we really want this in all cases?... + resp.setHeader("Cache-Control", + "max-age=0, must-revalidate"); ...that seems a little harsh. if we're going to do that it seems like it should be optional. (NOTE: it's not backwards compatible if people already have caches in front of Solr right now) 2) I've been reading about etags ... we need to make sure we don't inadvertently output an etag with "w/" in front (indicating it's a week entity tag) ... we should future proof against changes to IndexReader.getVersion() by putting a prefix on the version when making an etag. also: should we obfuscate the version (ie: hash it) so as not to risk disclosing info to people who shouldn't have it? 3) all etags are the same until the reader is reopened ... shouldn't they also hash on the URL? (is there a downside to multiple URLs having the same etag?) 4) are these semantics right? send PRECON_FAIL when "If-None-Match" tag matches and request is not GET or HEAD? (what about a POSTed query?) .... + if(ifNoneMatchList.size()>0 && isMatchingEtag(ifNoneMatchList,etag)) { + if(isGETRequest || isHEADRequest) { + sendNotModified(resp); + } else { + sendPreconditionFailed(resp); + } + return true; 5) using SolrIndexSearcher.openTime() as last-modified for query requests makes sense ... put what about updates? since RequestHandlers don't declare what they are, should we use "now" for POSTs and openTime for GET/HEAD ?
          Hide
          Erik Hatcher added a comment -

          targeting this for the 1.3 release.

          Show
          Erik Hatcher added a comment - targeting this for the 1.3 release.
          Hide
          Thomas Peuss added a comment -

          Added a unit test to check correct cache header behavior.

          Show
          Thomas Peuss added a comment - Added a unit test to check correct cache header behavior.
          Hide
          Thomas Peuss added a comment -

          Added a unit test for the cache header stuff...

          Show
          Thomas Peuss added a comment - Added a unit test for the cache header stuff...
          Hide
          Thomas Peuss added a comment -

          Some code cleanup and refactoring.

          Show
          Thomas Peuss added a comment - Some code cleanup and refactoring.
          Hide
          Thomas Peuss added a comment -

          Changed the behavior to first check for ETag related headers. Clients that support ETags have a greater possibility to have a cache-hit.

          Show
          Thomas Peuss added a comment - Changed the behavior to first check for ETag related headers. Clients that support ETags have a greater possibility to have a cache-hit.
          Hide
          Thomas Peuss added a comment -

          1.) I now use the time the reader was opened. But we should be aware of the fact that when we have two servers (for HA reasons for example) this times differ for sure. For clients with ETag support this is no problem because the ETag will be still the same.
          2.) You are right it is ETag. Clients/servers handle the headers case-insensitive. This is why I have not seen that...
          3.) Some clients support only ETag, some support only Last-Modified, many support both. That's why we should support both. And you are right: the ETags can more than we use.

          You spoke about taking the result of a search into account. Maybe we are talking about two different things here. This patch is about getting load off the server. When we want a 100% confident client then we need to take the server response into account. But currently I don't see a big benefit of this and it makes the code much more complex.

          Show
          Thomas Peuss added a comment - 1.) I now use the time the reader was opened. But we should be aware of the fact that when we have two servers (for HA reasons for example) this times differ for sure. For clients with ETag support this is no problem because the ETag will be still the same. 2.) You are right it is ETag. Clients/servers handle the headers case-insensitive. This is why I have not seen that... 3.) Some clients support only ETag, some support only Last-Modified, many support both. That's why we should support both. And you are right: the ETags can more than we use. You spoke about taking the result of a search into account. Maybe we are talking about two different things here. This patch is about getting load off the server. When we want a 100% confident client then we need to take the server response into account. But currently I don't see a big benefit of this and it makes the code much more complex.
          Hide
          Walter Underwood added a comment -

          Last-modified does require monotonic time, but ETags are version stamps without any ordering. The indexVersion should be fine for an ETag.

          Show
          Walter Underwood added a comment - Last-modified does require monotonic time, but ETags are version stamps without any ordering. The indexVersion should be fine for an ETag.
          Hide
          Hoss Man added a comment -

          1) it's not a good idea to assume the indexVersion can be used as a timestamp ... Lucene does not guarantee that. To be safe we should record the timestamp we opened the index at. (using the lastModified on files in the Directory is a bad idea as well ... someone could swap out an index with a backup and get "older" files that represent a "newer" index from Solr's perspective)

          1) isn't the header named "ETag" (not "Etag") ?

          2) I'm not an expert on all this new fangled HTTP/1.1 stuff ... but is an ETag based on the URI and the indexVersion/timestamp really that useful? wouldn't the Last-Modified header in that case be just as useful? I thought the value add of an ETag was that even if the content has been modified, if that modification results in no real changes, old cached values can still be useful. with Solr specificly in mind, the index may have changed, but if the results of a query are identicle to the results before the change, those cna have the same ETag right? wouldn't a hash of the URI and the SolrQueryResponse make more sense in that regards?

          Show
          Hoss Man added a comment - 1) it's not a good idea to assume the indexVersion can be used as a timestamp ... Lucene does not guarantee that. To be safe we should record the timestamp we opened the index at. (using the lastModified on files in the Directory is a bad idea as well ... someone could swap out an index with a backup and get "older" files that represent a "newer" index from Solr's perspective) 1) isn't the header named "ETag" (not "Etag") ? 2) I'm not an expert on all this new fangled HTTP/1.1 stuff ... but is an ETag based on the URI and the indexVersion/timestamp really that useful? wouldn't the Last-Modified header in that case be just as useful? I thought the value add of an ETag was that even if the content has been modified, if that modification results in no real changes, old cached values can still be useful. with Solr specificly in mind, the index may have changed, but if the results of a query are identicle to the results before the change, those cna have the same ETag right? wouldn't a hash of the URI and the SolrQueryResponse make more sense in that regards?
          Hide
          Thomas Peuss added a comment -

          Be even more standards compliant. If-Match and If-None-Match headers can appear multiple times.

          Show
          Thomas Peuss added a comment - Be even more standards compliant. If-Match and If-None-Match headers can appear multiple times.
          Hide
          Thomas Peuss added a comment -

          After reading the W3C docs I have seen that we can calculate the Etags in a much simpler way.

          Show
          Thomas Peuss added a comment - After reading the W3C docs I have seen that we can calculate the Etags in a much simpler way.
          Hide
          Thomas Peuss added a comment -

          Added Etag support.

          Show
          Thomas Peuss added a comment - Added Etag support.
          Hide
          Thomas Peuss added a comment -

          @Erik: Adding Etag support should be not that hard. I have a look into that. As the Etag value I propose a hash of the request URI and the index version.

          @Koji: I have a look if I can update the request count for Not-Modified responses. The question there is only if we really want that. The current counter tells us how many requests really reached the search handler. In my opinion that is what I want because the Not-Modified responses put not much pressure on the server.

          Show
          Thomas Peuss added a comment - @Erik: Adding Etag support should be not that hard. I have a look into that. As the Etag value I propose a hash of the request URI and the index version. @Koji: I have a look if I can update the request count for Not-Modified responses. The question there is only if we really want that. The current counter tells us how many requests really reached the search handler. In my opinion that is what I want because the Not-Modified responses put not much pressure on the server.
          Hide
          Erik Hatcher added a comment -
          Show
          Erik Hatcher added a comment - What about etags? http://intertwingly.net/blog/2006/06/05/Elevator-Pitch
          Hide
          Koji Sekiguchi added a comment -

          Note that this patch can do much for doing better throughput, but unluckily Solr doesn't take into account in stats (SOLR-176).

          Show
          Koji Sekiguchi added a comment - Note that this patch can do much for doing better throughput, but unluckily Solr doesn't take into account in stats ( SOLR-176 ).
          Hide
          Thomas Peuss added a comment -

          Some code cleanup and a fixed typo.

          Show
          Thomas Peuss added a comment - Some code cleanup and a fixed typo.
          Hide
          Thomas Peuss added a comment -

          Solr now responds nicely to HEAD-requests.

          Show
          Thomas Peuss added a comment - Solr now responds nicely to HEAD-requests.
          Hide
          Thomas Peuss added a comment -

          I have not found out (OK, I tried it only 10 minutes ) where this HEAD requests get blocked. Should be easy to do if you find the right location....

          Show
          Thomas Peuss added a comment - I have not found out (OK, I tried it only 10 minutes ) where this HEAD requests get blocked. Should be easy to do if you find the right location....
          Hide
          Thomas Peuss added a comment -

          Make Solr a bit more friendly for HTTP caches.

          Show
          Thomas Peuss added a comment - Make Solr a bit more friendly for HTTP caches.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              3 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development