Entities gzip:ed by mod_deflate still carries the same ETag as the plain entiy, causing inconsistency in ETag aware proxy caches. It is very important each unique entity carries unique ETag:s as these identify the specific entity variant of the URL. Each negotiated variant (where Accept-Encoding is just one negotioantio parameter) needs to have unique ETag:s. For mod_deflate it's as simple as adding the encoding to the already computed ETag. This has implications on at least the following HTTP directives: If-None-Match used in Vary negotiation from ETag aware caches If-Range ranges in gzip:ed entity obviously not the same as ranges in the plain entity If-Match mainly conditional PUT requests Example HTTP responses from an Apache-2.2.2 mod_deflate enabled server (irrelevant headers pruned): Plain request Server: Apache/2.2.2 (Fedora) ETag: "76e23-1835-4156af5e53ac0" Content-Length: 6197 Vary: Accept-Encoding,User-Agent Same request with "Accept-Encoding: gzip": Server: Apache/2.2.2 (Fedora) ETag: "76e23-1835-4156af5e53ac0" Vary: Accept-Encoding,User-Agent Content-Encoding: gzip Content-Length: 1829 Implications of this: * Clients may be given the incorrect response. In effect the first cached response is given to all clients as If-None-Match indicates the entitiy is OK for all clients.. (same ETag used in both responses -> Same If-None-Match request so mod_deflate can not tell if the If-None-Match condition is on a compressed or plain entity..) * Clients doing range requests with If-Range may end up with corrupted objects containing part compressed part plain content. Squid-2.6 and later is ETag aware and will make this problem quite visible. Release date for Squid-2.6 is 1/7 (i.e. in less than a month).
Created attachment 18407 [details] patch that'll cause mod_filter to unset the etag - see util_filter.h This needs more discussion before committing this or any other patch.
Some references My ETag notes: http://devel.squid-cache.org/ Old dev discussions: http://mail-archives.apache.org/mod_mbox/httpd-dev/200311.mbox/%3C3FB2E075.5010705@modperlcookbook.org%3E http://mail-archives.apache.org/mod_mbox/httpd-dev/200311.mbox/%3C3FB2C0CB.30401@sun.com%3E http://mail-archives.apache.org/mod_mbox/httpd-dev/200206.mbox/%3C00b801c20daf$736dd190$c000000a@KOJ%3E
(In reply to comment #1) > Created an attachment (id=18407) [edit] > patch that'll cause mod_filter to unset the etag - see util_filter.h > > This needs more discussion before committing this or any other patch. Sorry for my confusion, but this will only work if mod_deflate is used via mod_filter, right? It will not work if mod_deflate is used without mod_filter. I guess for this case it is needed to unset the ETag header inside mod_deflate. So something like the following: Index: mod_deflate.c =================================================================== --- mod_deflate.c (Revision 411469) +++ mod_deflate.c (Arbeitskopie) @@ -389,6 +389,7 @@ apr_table_mergen(r->headers_out, "Content-Encoding", "gzip"); } apr_table_unset(r->headers_out, "Content-Length"); + apr_table_unset(r->headers_out, "ETag"); /* initialize deflate output buffer */ ctx->stream.next_out = ctx->buffer;
(In reply to comment #3) > Sorry for my confusion, but this will only work if mod_deflate is used via > mod_filter, right? Yes. I mentioned that to the reporter in IRC, but not here, > apr_table_unset(r->headers_out, "Content-Length"); > + apr_table_unset(r->headers_out, "ETag"); Ugh. That way every filter has to reinvent protocol handling. A fertile breeding ground for bugs (and we have a history to prove it). mod_filter is designed to centralise that, so we only need to get the protocol right once.
(In reply to comment #4) > > apr_table_unset(r->headers_out, "Content-Length"); > > + apr_table_unset(r->headers_out, "ETag"); > > Ugh. That way every filter has to reinvent protocol handling. A fertile > breeding ground for bugs (and we have a history to prove it). mod_filter is Yes, and I am pretty sure we have this history :-), BUT mod_filter is not mandatory to use. > designed to centralise that, so we only need to get the protocol right once. Agreed, but then we must make the use of mod_filter (or at least the usage of these parts) mandatory or must incorporate them into the core filter routines.
From a protocol perspective removing the ETag is sufficient to make you compliant. If conditionals (If-xxx) anyway doesn't work right on transformed responses there is not much benefit of sending an ETag out. But if you can it's better if you send an ETag. As I said initially you don't need to compute a new etag, just adding some extra detail to the tag is fine. I.e. "638f3e-6-1b6d6340-gzip" or similar for a gzip:ed entity where the base entity had the etag "638f3e-6-1b6d6340". To HTTP the etag is just a string with the only requirement that it must be unique for each entity variants of the same URL. Actually I think adding details to the ETag may simplify many things for you as the core routines then can make quick asssssments of conditionals if it's possible to infer information about how the object had been processed from looking at the entity tag.
Any progress on getting this patch (or another reasonable alternative) into the mod_deflate tree?
(In reply to comment #7) > Any progress on getting this patch (or another reasonable alternative) into the > mod_deflate tree? It needs raising on dev@ so we can reach a consensus solution. Bugzilla has only proved that we have more than one competing solution.
This needs to be fixed by mod_deflate producing a new etag. How we do that is going to take some investigation, since it doesn't do any good to produce the etag unless we can also check it on conditional requests.
My suggestion is to simply extend the existing etag with a gzip marker, for example adding ;gzip at the end or something like that. I.e. if the original reply had ETag: "6bf1f7-6-1b6d6340" Then make mod-gzip translate this to ETag: "6bf1f7-6-1b6d6340;gzip" This should allows for easy bidirectional mapping, simplifying most conditionals as no transformation of the entity body is needed to find the etag, and the simple format makes it easier to trace should any misunderstandings occur.
Pinging dev@ one more time..
Just committed a fix to make any ETag weak if we transform the entity. Hopefully this should fix protocol compliance (and our users) without being controversial.
Not sufficient. The two versions is not semantically equivalen as one can not be exchanged for the other without breaking the protocol. In the context of If-None-Match the weak comparator is used in HTTP and there a strong ETag is equal to a weak ETag.
Can you elaborate in more detail why you think that the two versions are not semantically equivalent? I read 13.3.3 in a way that they are.
Because you can not exchange the gzip:ed variant with the identity encoded variant wihout causing breakage. The two do not mean the same thing to a recipient who do not know how to handle gzip. The two is only semantically equivalent for a recipient capable of handling gzip, but not to HTTP in general as HTTP do not guarantee clients can handle gzip. If they were semantically equivalent then there would be no need for conditional mod_gzip compression, or the use of Vary, at least not other than to reduce the load on the server under peak load...
What you can do is to either a) Drop the ETag completely. This is not opimal but works.. b) Or modify the ETag value in some manner. For example adding a constant string infront or after the original ETag. In 'b', if the compression is not deterministic and always resulting in the same encoding then the ETag should additionally be made weak, to make sure no one attemtps merging partial responses down the line.. The main downside of 'a' is that ETag aware caches will then cache multiple copies of the same object, one per each slight varance of Vary indicated headers. For Apache itself it's not so big difference until conditional requests works proper in precense of filters like mod_deflate (i.e. If-None-Match).
(In reply to comment #15) > Because you can not exchange the gzip:ed variant with the identity encoded > variant wihout causing breakage. The two do not mean the same thing to a > recipient who do not know how to handle gzip. Bugzilla is the wrong place for this discussion. Should be on dev@httpd. Only a recipient that can handle gzip will be served the gzipped version. > The two is only semantically equivalent for a recipient capable of handling > gzip, but not to HTTP in general as HTTP do not guarantee clients can handle gzip. HTTP provides a separate mechanism for negotiating that. > > If they were semantically equivalent then there would be no need for conditional > mod_gzip compression, or the use of Vary, at least not other than to reduce the > load on the server under peak load... Huh? Those exist precisely because we need to cater for different clients.
(In reply to comment #17) > Only a recipient that can handle gzip will be served the gzipped version. Which isn't true due to this bug. If there is a ETag aware cache between the client and Apache the client will be given whatever the previous client could handle. > Huh? Those exist precisely because we need to cater for different clients. Exactly.
(In reply to comment #18) > (In reply to comment #17) > > > Only a recipient that can handle gzip will be served the gzipped version. > > Which isn't true due to this bug. If there is a ETag aware cache between the > client and Apache the client will be given whatever the previous client could > handle. The intermediate got a weak ETag. So the intermediate has been told that the entity is equivalent but not byte-by-byte identical, and may be subject to negotiated transformation. Therefore the intermediate is responsible for dealing with content-negotiated properties. Do you have a particular intermediate in mind, when you propose something that treats a weak ETag as strong?
2.2 r608849 http://svn.apache.org/viewvc?view=rev&revision=608849
http://svn.apache.org/viewvc?view=rev&revision=581198 http://svn.apache.org/viewvc?view=rev&revision=607219
This fix needs improvements. Etag needs to be quoted ; this fix adds -gzip outside the quotes, so I get things like ""638f3e-6-1b6d6340"-gzip which is ugly and not very RFC compliant. Now, another problem I got. I have 2 servers with mod_deflate and mod_cache. with mod_cache, I get validation -> If behind the same squid proxy I have servers with DeflateCompressionLevel set to 1, and other ones with DeflateCompressionLevel 7, and mod_cache enabled on get validation on different contents. (yes, I know that's a change strange setup).
Created attachment 23050 [details] patch that fixes the etags transformed by mod_deflate to be quoted strings Addresses the issue raised by Maxime Ritter - currently the Etags that are transformed by mod_deflate are not properly quoted and are not RFC compliant.
Created attachment 23051 [details] fix etag checking in content handlers by stripping "-gzip" from etags in if headers A problem with adding "-gzip" to etags is that it breaks etag checking in If-* headers for content handlers ( e.g. mod_dav ) which will not recognize the "-gzip" etag as a valid etag for any entity of the resource. One way to fix this is to strip the "-gzip" suffix from the etags in If-None-Match and If-Match request headers. Attaching a patch to achieve this. It implements a fixup hook in mod_deflate and fixes etags in the respective headers. The patch has been tested with mod_dav_fs for If-Match and If-None-Match headers, with and without gzip encoding. Note: this patch depends on the previous patch (https://issues.apache.org/bugzilla/attachment.cgi?id=23050) having already been applied.
Committed patch to trunk fixing the creation of invalid Etag headers such as Etag: "2106e9-2c-3e9564c23b60"-gzip instead of Etag: "2106e9-2c-3e9564c23b60-gzip" mod_deflate ignores invalid Etag headers not starting with a double quote, and weak Etag headers starting with "W/". http://svn.apache.org/viewvc?view=rev&revision=740149
What do you mean by ignore weak etags? If there is a weak ETag then it needs to be transformed as well, or removed. If not you'll still crash caches out there as object variants is identified by their ETag.
(In reply to comment #26) > What do you mean by ignore weak etags? > > If there is a weak ETag then it needs to be transformed as well, or removed. If > not you'll still crash caches out there as object variants is identified by > their ETag. > This is discussible. A weak ETAG IMHO doesn't mean that both entities with the same weak ETAG are the same on binary level. But the weak ETAG only changes when the meaning of the entity changes (13.3.3, 3rd paragraph).
(In reply to comment #26) > What do you mean by ignore weak etags? Well, the original code was only adding the gzip marker when the Etag was not starting with "W/", and I didn't changed this behavior.
A weak ETag means the two are interchangeable for the same request (semantically equivalent) but may differ significantly at the octet level. A gzip and identity encoded entity is not interchangeable without serious breakage.
Just to clarify the breakage: GET /some-object HTTP/1.1 200 OK Vary: Accept-Encoding ETag: W/"a" GET /some-object Accept-Encoding: gzip If-None-Match: W/"a" HTTP/1.1 304 Not Modified ETag: W/"a" If you are unsure what this is about, see 13.6 Caching Negotiated Responses. To explain it in other words: Two resource versions MAY share the same weak ETag (but MUST NOT when using a strong ETag), but two incompatible resource representations MUST NOT.
The HTTP syntax error has been fixed in trunk, but the problem motivating this report is a no-win situation no matter how it is "fixed". The only good answer is "don't use mod_deflate" because changing content-encoding on the fly in an inconsistent manner (neither "never" nor "always) makes it impossible for later requests regarding that content (e.g., PUT or conditional GET) to be handled correctly. This is, of course, why performing on-the-fly content-encoding is a stupid idea, and why I added Transfer-Encoding to HTTP as the proper way to do on-the-fly encoding without changing the resource. mod_deflate is written as a content filter that can be arbitrarily added to the output chain after the request is processed, just before the body goes out on the wire. If mod_deflate modifies ETag on the way out, then its corresponding later requests must be reverse-modified (etags and request content) on the way back. The problem here is that the DEFLATE filter is usually added after the request is processed, based on the media type of the response, so there is no clear way of selectively inflating a corresponding PUT or conditional request before the request processing is begun, especially if the request has been proxied to another server. We would have to add a corresponding input filter whenever the output filter is configured and ensure that it would activate under the same conditions, based on the request header fields, as DEFLATE/INFLATE does for responses. I am still looking at this option. Preprocessing all incoming conditional headers to remove a -gzip suffix before the request is processed won't work. In a chain of Apache servers, we won't know which server set the suffix and how many caches have stored the modified ETag versus the unmodified ETag. We can't add some random unique id to the suffix, either, since we need the tag to persist across restarts. In any case, that solution becomes so complex that we are better off deleting the module. Finally, we can't just remove the ETag because then the unfiltered content has an ETag but the filtered content does not, which puts us back to the point of messing up a cache that is checking the 304 response for consistency. Likewise, removing etags for the entire configured scope allows clients to use the last-modified timestamp for range requests, which would be just as bad as not changing ETag. The best solution is to implement transfer-encoding as an http protocol filter module.
Deleting the ETag+Content-Location is safe with respect to caches even if very suboptimal in terms of HTTP performance and cache validation robustness. It's highly undesireable, but still better than sending out the same ETag or Content-Location on incompatible respones. Sending responses with an ETag and/or Content-Location which MAY be shared by an incompatible response for the same URL makes a true mess for caches. This applies to both 200 and 304 responses. The mess gets injected by the processing of 304 responses which may make incompatible but identified equal content migrate between different requeests. And yes, Roy is absolutely right. HTTP is not well suited for on-the-fly content recoding. You can't both eat the cake and keep it unless you do a lot of effort, far beyond the recoding itself. To anyone external from the server it SHOULD look like the recoding is in fact done statically with different representations stored on the server (i.e. page.html and page.html.gz) as negotiated by mod_negotiate, That means unique ETags and unique Content-Location, plus the If-* conditions working properly for all combinations. Not impossible, but not easy either.
Hi, I refer to https://issues.apache.org/bugzilla/show_bug.cgi?id=47253 . In this situation imho fixing it this way is retrograde step. There is a question I would like to ask here: Why you do not determined a policy? - Etag will formed as usual. - Configuration tree, that handle a resource, add fourth digest part into Etag-values. - Each filter transforming contents (on-the-fly) have to add a suffix. I imagine, each handler have a unique key (let's say mod_deflate "DE" and mod_include "IN" etc.). Configuration tree means a string consisting of all keys [ simple: next = etag_uint64_to_hex(next, "IN:CH:DE"); ]. So a change of configurations will be unique in Etag as well. But sensible configuration information should encodet twice (first time nonreversible by including a static unique-ID of environment) to withhold clients from it. Already configuration tree is being parsed. It means there is a small overhead in prucedures like that. https://svn.apache.org/viewcvs.cgi?view=rev&rev=761835 it works fine befor change. Only you should remove all added suffixes, then check condition requests. Befor Etag-value are comprised that way: "%{resource_digest}"-suffix Only check %{resource_digest}. (modules/http/http_protocol.c; function ap_meets_conditions(); line 270) The worse is yet to come: I see now way to determine mod_ext_filter. Handler module should remove all symbols form responce header line Cache-Control how make it cache able, and maybe statically add symbols to force inability. I'm sorry. English is not my nativ language. ;-\ With best regards from Berlin eddi
We're encountering this problem using Apache as a front end SSL / compressing accelerator. Is there any chance of getting a patch in to permit stripping incoming ETag's of their -gzip suffix based on a configuration option ? In our topology we know precisely where -gzip is added, and thus how to strip it safely; we'd rather strip it in the stage matching where it was added outbound, rather than at a different step. I realise that this isn't theoretically complete, but crucially for us, it would do the job reliably.
*** Bug 49358 has been marked as a duplicate of this bug. ***
Any news about this more than five years old bug?
Since different gzip compression levels are semantically equal, you can _always_ send weak tag when using gzip with any compression level, with -gzip suffix like that: ETag: W/"76e23-1835-4156af5e53ac0-gzip"
The server response contains proper Vary header clearly indicating the response varies depending whether client is able to accept gzip content or not. In that case, the responsibility lies with the intermediate proxy to make sure all conditional headers check are met before sending a cached response for an ETag. For the implications mentioned in the description: * The repeat request from same client will have same value for "Accept-Encoding" header as well as User Agent string meaning Apache has sufficient information to decide whether to send plain text or gzipped response. If-None-Match can have same ETag value in both case and still server should have no problem deciding which response to send. * The above logic covers range queries as well.
This has already been discussed to death and still comes back... There is no escape from the rule that each variant of a given URL MUST have a unique ETag value, or none at all. How the ETag value is formed is entirely up ot the server implementation and may carry any amount of unstructured and structured data as needed by the server to uniquely identify a variant. Weak ETags have slightly different rules but is irrelevant to this discussion. Applies i.e. to when using dynamic adjustment of gzip encoding levels based on CPU load, but not for identity vs gzip encoded variant which are semantically different. (In reply to Anshul from comment #38) > The server response contains proper Vary header clearly indicating the > response varies depending whether client is able to accept gzip content or > not. No. The server has sent a Vary header indicating that the servers variant selection depends on the content of the Accept-Encoding header in the request, and quite often User-Agent as well. > In that case, the responsibility lies with the intermediate proxy to make > sure all conditional headers check are met before sending a cached response > for an ETag. No. It's the origin servers responsibility to perform variant selection. Caches uses If-None-Match to ask the server which variant among a set of known cached variants of the requested URL should be used in response to unknown request combinations. The response to such requests ONLY says "Use the variant with ETag XXXX". Semantically transparent proxies are not allowed to guess what variant selection preferences the server has. I.e. which browsers it had blacklisted content of type X for etc, or which browsers the server knows handles gzip content encoding when there is not Accept-encoding header present. gzip compression is onlhy a tiny tiny little bit of server side variant selection. The same mechanism for selecting the correct variant amont a set of cached variants of a URL is used for a vide variety of response variance (selection of content-encoding, content-language, content-type, browser based, custom headers, etc etc) Note: Apache mod_negotiation does the right thing in all cases known. Issues only arise with dynamic content encoding with mod_deflate (and a numbe of other similar modules performing dynamic content transformation) often forgetting about the meaning of ETag and it's relation to If-None-Match.
Just so folks know, the authoritative text on this topic will soon be: http://tools.ietf.org/html/draft-ietf-httpbis-p4-conditional-26
I've been going through the comments and some of the links mentioned and I'm unsure if this issue will be resolved or I should implement a work-around on my side. This ticket is open for 9 years now and it is still relevant with apache 2.4. Is it a WONT FIX ? What's the recent status here for apache users? Thanks
And again, three years have passed and DeflateAlterETag is available in trunk, but not in 2.4. What's holding this back? BrotliAlterETag made it into 2.4, so remarks about not mangling ETag are moot now. It's happening for mod_brotli, it's about time for mod_deflate too.
I have just reported the very same issue with Tomcat: Bug 63932.
I havet not studied the code in detail, but a note of warning when stripping information from the etag in if clauses, it is not always straight forward. For example if you add a nice gzip suffix to the etag when gzip encoding and then strip this in if-none-match processing then you risk creating the same problem all over again as you loose the distinction between an identity encoded variant and an gzip encoded variant. What you should do is to reconstruct the actual etag the server would have responded with and then compare this with the if clauses. You may take whatever shortcuts you like in this process as long as the result is consistent.
This is still an issue with Apache 2.4, almost 15 years after this ticket was created. It's noteworthy that I came across this bug when looking the Perl Plack::Middleware::ETag module, which returns malformed ETag headers that are not quoted. When using a fixed version of the module, I found tht that Apache reverse proxy was modifying the ETag headers.