Bug 39727 - Incorrect ETag on gzip:ed content
Incorrect ETag on gzip:ed content
Status: ASSIGNED
Product: Apache httpd-2
Classification: Unclassified
Component: mod_deflate
2.2.8
All All
: P1 normal with 17 votes (vote)
: ---
Assigned To: Apache HTTPD Bugs Mailing List
: RFC
: 49358 (view as bug list)
Depends on:
Blocks: 45023 47253
  Show dependency tree
 
Reported: 2006-06-05 23:43 UTC by Henrik Nordstrom
Modified: 2014-04-30 10:01 UTC (History)
18 users (show)



Attachments
patch that'll cause mod_filter to unset the etag - see util_filter.h (1.03 KB, patch)
2006-06-06 00:43 UTC, Nick Kew
Details | Diff
patch that fixes the etags transformed by mod_deflate to be quoted strings (1.17 KB, patch)
2008-12-24 14:59 UTC, Paritosh Shah
Details | Diff
fix etag checking in content handlers by stripping "-gzip" from etags in if headers (1.79 KB, patch)
2008-12-26 13:32 UTC, Paritosh Shah
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Henrik Nordstrom 2006-06-05 23:43:49 UTC
Entities gzip:ed by mod_deflate still carries the same ETag as the plain entiy,
causing inconsistency in ETag aware proxy caches.

It is very important each unique entity carries unique ETag:s as these identify
the specific entity variant of the URL. Each negotiated variant (where
Accept-Encoding is just one negotioantio parameter) needs to have unique ETag:s.
For mod_deflate it's as simple as adding the encoding to the already computed ETag.

This has implications on at least the following HTTP directives:

   If-None-Match  used in Vary negotiation from ETag aware caches
   If-Range       ranges in gzip:ed entity obviously not the same as ranges in
the plain entity
   If-Match       mainly conditional PUT requests


Example HTTP responses from an Apache-2.2.2 mod_deflate enabled server
(irrelevant headers pruned):

Plain request

   Server: Apache/2.2.2 (Fedora)
   ETag: "76e23-1835-4156af5e53ac0"
   Content-Length: 6197
   Vary: Accept-Encoding,User-Agent
   

Same request with "Accept-Encoding: gzip":

   Server: Apache/2.2.2 (Fedora)
   ETag: "76e23-1835-4156af5e53ac0"
   Vary: Accept-Encoding,User-Agent
   Content-Encoding: gzip
   Content-Length: 1829



Implications of this:

  * Clients may be given the incorrect response. In effect the first cached
response is given to all clients as If-None-Match indicates the entitiy is OK
for all clients..  (same ETag used in both responses -> Same If-None-Match
request so mod_deflate can not tell if the If-None-Match condition is on a
compressed or plain entity..)

  * Clients doing range requests with If-Range may end up with corrupted objects
containing part compressed part plain content.


Squid-2.6 and later is ETag aware and will make this problem quite visible.
Release date for Squid-2.6 is 1/7 (i.e. in less than a month).
Comment 1 Nick Kew 2006-06-06 00:43:33 UTC
Created attachment 18407 [details]
patch that'll cause mod_filter to unset the etag - see util_filter.h

This needs more discussion before committing this or any other patch.
Comment 3 Ruediger Pluem 2006-06-06 20:49:19 UTC
(In reply to comment #1)
> Created an attachment (id=18407) [edit]
> patch that'll cause mod_filter to unset the etag - see util_filter.h
> 
> This needs more discussion before committing this or any other patch.

Sorry for my confusion, but this will only work if mod_deflate is used via
mod_filter, right?
It will not work if mod_deflate is used without mod_filter. I guess for this
case it is needed to unset the ETag header inside mod_deflate. So something like
the following:

Index: mod_deflate.c
===================================================================
--- mod_deflate.c       (Revision 411469)
+++ mod_deflate.c       (Arbeitskopie)
@@ -389,6 +389,7 @@
             apr_table_mergen(r->headers_out, "Content-Encoding", "gzip");
         }
         apr_table_unset(r->headers_out, "Content-Length");
+        apr_table_unset(r->headers_out, "ETag");

         /* initialize deflate output buffer */
         ctx->stream.next_out = ctx->buffer;
Comment 4 Nick Kew 2006-06-06 21:40:29 UTC
(In reply to comment #3)

> Sorry for my confusion, but this will only work if mod_deflate is used via
> mod_filter, right?

Yes.  I mentioned that to the reporter in IRC, but not here,

>          apr_table_unset(r->headers_out, "Content-Length");
> +        apr_table_unset(r->headers_out, "ETag");

Ugh.  That way every filter has to reinvent protocol handling.  A fertile 
breeding ground for bugs (and we have a history to prove it).  mod_filter is 
designed to centralise that, so we only need to get the protocol right once.
Comment 5 Ruediger Pluem 2006-06-06 22:02:11 UTC
(In reply to comment #4)
> >          apr_table_unset(r->headers_out, "Content-Length");
> > +        apr_table_unset(r->headers_out, "ETag");
> 
> Ugh.  That way every filter has to reinvent protocol handling.  A fertile 
> breeding ground for bugs (and we have a history to prove it).  mod_filter is

Yes, and I am pretty sure we have this history :-), BUT mod_filter is not
mandatory to use.
 
> designed to centralise that, so we only need to get the protocol right once.

Agreed, but then we must make the use of mod_filter (or at least the usage of
these parts) mandatory or must incorporate them into the core filter routines.
Comment 6 Henrik Nordstrom 2006-06-07 03:57:53 UTC
From a protocol perspective removing the ETag is sufficient to make you
compliant. If conditionals (If-xxx) anyway doesn't work right on transformed
responses there is not much benefit of sending an ETag out.

But if you can it's better if you send an ETag. As I said initially you don't
need  to compute a new etag, just adding some extra detail to the tag is fine.

I.e. "638f3e-6-1b6d6340-gzip" or similar for a gzip:ed entity where the base
entity had the etag "638f3e-6-1b6d6340".  To HTTP the etag is just a string with
the only requirement that it must be unique for each entity variants of the same
URL.

Actually I think adding details to the ETag may simplify many things for you as
the core routines then can make quick asssssments of conditionals if it's
possible to infer information about how the object had been processed from
looking at the entity tag.
Comment 7 Henrik Nordstrom 2006-11-20 01:11:19 UTC
Any progress on getting this patch (or another reasonable alternative) into the
mod_deflate tree?
Comment 8 Nick Kew 2006-11-20 01:23:19 UTC
(In reply to comment #7)
> Any progress on getting this patch (or another reasonable alternative) into the
> mod_deflate tree?

It needs raising on dev@ so we can reach a consensus solution.  Bugzilla has only proved that we have 
more than one competing solution.
Comment 9 Roy T. Fielding 2006-12-06 16:30:13 UTC
This needs to be fixed by mod_deflate producing a new etag.  How we do that
is going to take some investigation, since it doesn't do any good to produce
the etag unless we can also check it on conditional requests.
Comment 10 Henrik Nordstrom 2006-12-07 14:19:16 UTC
My suggestion is to simply extend the existing etag with a gzip marker, for
example adding ;gzip at the end or something like that.

I.e. if the original reply had

ETag: "6bf1f7-6-1b6d6340"

Then make mod-gzip translate this to

ETag: "6bf1f7-6-1b6d6340;gzip"

This should allows for easy bidirectional mapping, simplifying most conditionals
as no transformation of the entity body is needed to find the etag, and the
simple format makes it easier to trace should any misunderstandings occur.
Comment 11 Henrik Nordstrom 2007-08-27 05:25:52 UTC
Pinging dev@ one more time..
Comment 12 Nick Kew 2007-10-02 04:52:22 UTC
Just committed a fix to make any ETag weak if we transform the entity. 
Hopefully this should fix protocol compliance (and our users) without being
controversial.
Comment 13 Henrik Nordstrom 2007-10-02 11:30:50 UTC
Not sufficient. The two versions is not semantically equivalen as one can not be
exchanged for the other without breaking the protocol. In the context of
If-None-Match the weak comparator is used in HTTP and there a strong ETag is
equal to a weak ETag.
Comment 14 Ruediger Pluem 2007-10-02 11:51:52 UTC
Can you elaborate in more detail why you think that the two versions are not
semantically equivalent? I read 13.3.3 in a way that they are.
Comment 15 Henrik Nordstrom 2007-10-02 12:22:16 UTC
Because you can not exchange the gzip:ed variant with the identity encoded
variant wihout causing breakage. The two do not mean the same thing to a
recipient who do not know how to handle gzip.

The two is only semantically equivalent for a recipient capable of handling
gzip, but not to HTTP in general as HTTP do not guarantee clients can handle gzip.

If they were semantically equivalent then there would be no need for conditional
mod_gzip compression, or the use of Vary, at least not other than to reduce the
load on the server under peak load...
Comment 16 Henrik Nordstrom 2007-10-02 12:30:37 UTC
What you can do is to either

a) Drop the ETag completely. This is not opimal but works..

b) Or modify the ETag value in some manner. For example adding a constant string
infront or after the original ETag.

In 'b', if the compression is not deterministic and always resulting in the same
encoding then the ETag should additionally be made weak, to make sure no one
attemtps merging partial responses down the line..



The main downside of 'a' is that ETag aware caches will then cache multiple
copies of the same object, one per each slight varance of Vary indicated
headers. For Apache itself it's not so big difference until conditional requests
works proper in precense of filters like mod_deflate (i.e. If-None-Match).
Comment 17 Nick Kew 2007-10-02 12:54:58 UTC
(In reply to comment #15)
> Because you can not exchange the gzip:ed variant with the identity encoded
> variant wihout causing breakage. The two do not mean the same thing to a
> recipient who do not know how to handle gzip.

Bugzilla is the wrong place for this discussion.  Should be on dev@httpd.

Only a recipient that can handle gzip will be served the gzipped version.

> The two is only semantically equivalent for a recipient capable of handling
> gzip, but not to HTTP in general as HTTP do not guarantee clients can handle gzip.

HTTP provides a separate mechanism for negotiating that.

> 
> If they were semantically equivalent then there would be no need for conditional
> mod_gzip compression, or the use of Vary, at least not other than to reduce the
> load on the server under peak load...

Huh?  Those exist precisely because we need to cater for different clients.
Comment 18 Henrik Nordstrom 2007-10-02 15:10:37 UTC
(In reply to comment #17)

> Only a recipient that can handle gzip will be served the gzipped version.

Which isn't true due to this bug. If there is a ETag aware cache between the
client and Apache the client will be given whatever the previous client could
handle.

> Huh?  Those exist precisely because we need to cater for different clients.

Exactly.
Comment 19 Nick Kew 2007-10-03 05:18:50 UTC
(In reply to comment #18)
> (In reply to comment #17)
> 
> > Only a recipient that can handle gzip will be served the gzipped version.
> 
> Which isn't true due to this bug. If there is a ETag aware cache between the
> client and Apache the client will be given whatever the previous client could
> handle.

The intermediate got a weak ETag.  So the intermediate has been told that the
entity is equivalent but not byte-by-byte identical, and may be subject to
negotiated transformation.  Therefore the intermediate is responsible for
dealing with content-negotiated properties.

Do you have a particular intermediate in mind, when you propose something that
treats a weak ETag as strong?
Comment 22 Maxime Ritter 2008-09-11 06:00:58 UTC
This fix needs improvements. Etag needs to be quoted ;  this fix adds -gzip outside the quotes, so I get things like ""638f3e-6-1b6d6340"-gzip which is ugly and not very RFC compliant.

Now, another problem I got. I have 2 servers with mod_deflate and mod_cache. with mod_cache, I get validation 

-> If behind the same squid proxy I have servers with DeflateCompressionLevel set to 1, and other ones with DeflateCompressionLevel 7, and mod_cache enabled on get validation on different contents.
(yes, I know that's a change strange setup).
Comment 23 Paritosh Shah 2008-12-24 14:59:45 UTC
Created attachment 23050 [details]
patch that fixes the etags transformed by mod_deflate to be quoted strings

Addresses the issue raised by Maxime Ritter - currently the Etags that are transformed by mod_deflate are not properly quoted and are not RFC compliant.
Comment 24 Paritosh Shah 2008-12-26 13:32:50 UTC
Created attachment 23051 [details]
fix etag checking in content handlers by stripping "-gzip" from etags in if headers

A problem with adding "-gzip" to etags is that it breaks etag checking in If-* headers for content handlers ( e.g. mod_dav ) which will not recognize the "-gzip" etag as a valid etag for any entity of the resource. One way to fix this is to strip the "-gzip" suffix from the etags in If-None-Match and If-Match request headers. Attaching a patch to achieve this. It implements a fixup hook in mod_deflate and fixes etags in the respective headers. The patch has been tested with mod_dav_fs for If-Match and If-None-Match headers, with and without gzip encoding.

Note: this patch depends on the previous patch (https://issues.apache.org/bugzilla/attachment.cgi?id=23050) having already been applied.
Comment 25 Lars Eilebrecht 2009-02-02 15:27:04 UTC
Committed patch to trunk fixing the creation of invalid Etag headers
such as 

  Etag: "2106e9-2c-3e9564c23b60"-gzip

instead of 

  Etag: "2106e9-2c-3e9564c23b60-gzip"

mod_deflate ignores invalid Etag headers not starting with a double quote,
and weak Etag headers starting with "W/". 

http://svn.apache.org/viewvc?view=rev&revision=740149
Comment 26 Henrik Nordstrom 2009-02-03 00:24:45 UTC
What do you mean by ignore weak etags?

If there is a weak ETag then it needs to be transformed as well, or removed. If not you'll still crash caches out there as object variants is identified by their ETag.
Comment 27 Ruediger Pluem 2009-02-03 03:13:08 UTC
(In reply to comment #26)
> What do you mean by ignore weak etags?
> 
> If there is a weak ETag then it needs to be transformed as well, or removed. If
> not you'll still crash caches out there as object variants is identified by
> their ETag.
> 

This is discussible. A weak ETAG IMHO doesn't mean that both entities with the same weak ETAG are the same on binary level. But the weak ETAG only changes when the meaning of the entity changes (13.3.3, 3rd paragraph).
Comment 28 Lars Eilebrecht 2009-02-03 04:34:33 UTC
(In reply to comment #26)
> What do you mean by ignore weak etags?

Well, the original code was only adding the gzip marker when the
Etag was not starting with "W/", and I didn't changed this behavior.
Comment 29 Henrik Nordstrom 2009-02-03 06:17:14 UTC
A weak ETag means the two are interchangeable for the same request (semantically equivalent) but may differ significantly at the octet level.

A gzip and identity encoded entity is not interchangeable without serious breakage.
Comment 30 Henrik Nordstrom 2009-02-03 06:23:55 UTC
Just to clarify the breakage:

GET /some-object

HTTP/1.1 200 OK
Vary: Accept-Encoding
ETag: W/"a"

GET /some-object
Accept-Encoding: gzip
If-None-Match: W/"a"

HTTP/1.1 304 Not Modified
ETag: W/"a"


If you are unsure what this is about, see 13.6  Caching Negotiated Responses.


To explain it in other words: Two resource versions MAY share the same weak ETag (but MUST NOT when using a strong ETag), but two incompatible resource representations MUST NOT.
Comment 31 Roy T. Fielding 2009-02-12 17:08:56 UTC
The HTTP syntax error has been fixed in trunk, but the problem
motivating this report is a no-win situation no matter how it is
"fixed".  The only good answer is "don't use mod_deflate" because
changing content-encoding on the fly in an inconsistent manner
(neither "never" nor "always) makes it impossible for later
requests regarding that content (e.g., PUT or conditional GET)
to be handled correctly.  This is, of course, why performing
on-the-fly content-encoding is a stupid idea, and why I added
Transfer-Encoding to HTTP as the proper way to do on-the-fly
encoding without changing the resource.

mod_deflate is written as a content filter that can be arbitrarily
added to the output chain after the request is processed, just
before the body goes out on the wire.  If mod_deflate modifies
ETag on the way out, then its corresponding later requests must
be reverse-modified (etags and request content) on the way back.

The problem here is that the DEFLATE filter is usually
added after the request is processed, based on the media type
of the response, so there is no clear way of selectively inflating
a corresponding PUT or conditional request before the request
processing is begun, especially if the request has been
proxied to another server.  We would have to add a corresponding
input filter whenever the output filter is configured and ensure
that it would activate under the same conditions, based on the
request header fields, as DEFLATE/INFLATE does for responses.
I am still looking at this option.

Preprocessing all incoming conditional headers to remove
a -gzip suffix before the request is processed won't work.
In a chain of Apache servers, we won't know which server
set the suffix and how many caches have stored the modified
ETag versus the unmodified ETag.  We can't add some random
unique id to the suffix, either, since we need the tag to
persist across restarts.  In any case, that solution becomes
so complex that we are better off deleting the module.

Finally, we can't just remove the ETag because then the
unfiltered content has an ETag but the filtered content
does not, which puts us back to the point of messing up
a cache that is checking the 304 response for consistency.
Likewise, removing etags for the entire configured scope
allows clients to use the last-modified timestamp for range
requests, which would be just as bad as not changing ETag.

The best solution is to implement transfer-encoding as an
http protocol filter module.
Comment 32 Henrik Nordstrom 2009-02-12 18:42:13 UTC
Deleting the ETag+Content-Location is safe with respect to caches even if very suboptimal in terms of HTTP performance and cache validation robustness. It's highly undesireable, but still better than sending out the same ETag or Content-Location on incompatible respones.

Sending responses with an ETag and/or Content-Location which MAY be shared by an incompatible response for the same URL makes a true mess for caches. This applies to both 200 and 304 responses. The mess gets injected by the processing of 304 responses which may make incompatible but identified equal content migrate between different requeests.

And yes, Roy is absolutely right. HTTP is not well suited for on-the-fly content recoding. You can't both eat the cake and keep it unless you do a lot of effort, far beyond the recoding itself. To anyone external from the server it SHOULD look like the recoding is in fact done statically with different representations stored on the server (i.e. page.html and page.html.gz) as negotiated by mod_negotiate, That means unique ETags and unique Content-Location, plus the If-* conditions working properly for all combinations.

Not impossible, but not easy either.
Comment 33 Edgar Ehritt 2009-05-23 05:22:29 UTC
Hi,


I refer to https://issues.apache.org/bugzilla/show_bug.cgi?id=47253 . In
this situation imho fixing it this way is retrograde step. There is a
question I would like to ask here:

Why you do not determined a policy?

   - Etag will formed as usual.
   - Configuration tree, that handle a resource, add fourth digest part
     into Etag-values.
   - Each filter transforming contents (on-the-fly) have to add a suffix.

I imagine, each handler have a unique key (let's say mod_deflate "DE" and
mod_include "IN" etc.). Configuration tree means a string consisting of
all keys [ simple: next = etag_uint64_to_hex(next, "IN:CH:DE"); ]. So a
change of configurations will be unique in Etag as well. But sensible configuration information should encodet twice (first time nonreversible
by including a static unique-ID of environment) to withhold clients from
it. Already configuration tree is being parsed. It means there is a small
overhead in prucedures like that.

https://svn.apache.org/viewcvs.cgi?view=rev&rev=761835 it works fine befor
change. Only you should remove all added suffixes, then check condition
requests. Befor Etag-value are comprised that way:
"%{resource_digest}"-suffix
Only check %{resource_digest}. (modules/http/http_protocol.c; function
ap_meets_conditions(); line 270)

The worse is yet to come:
I see now way to determine mod_ext_filter. Handler module should remove
all symbols form responce header line Cache-Control how make it cache
able, and maybe statically add symbols to force inability.

I'm sorry. English is not my nativ language. ;-\


With best regards from Berlin
eddi
Comment 34 Robert Collins 2010-05-18 17:42:42 UTC
We're encountering this problem using Apache as a front end SSL / compressing accelerator. Is there any chance of getting a patch in to permit stripping incoming ETag's of their -gzip suffix based on a configuration option ? In our topology we know precisely where -gzip is added, and thus how to strip it safely; we'd rather strip it in the stage matching where it was added outbound, rather than at a different step.

I realise that this isn't theoretically complete, but crucially for us, it would do the job reliably.
Comment 35 Rainer Jung 2010-05-29 08:25:05 UTC
*** Bug 49358 has been marked as a duplicate of this bug. ***
Comment 36 Oliver Siegmar 2011-09-14 11:51:55 UTC
Any news about this more than five years old bug?
Comment 37 Andrey Chernov 2012-08-09 22:42:23 UTC
Since different gzip compression levels are semantically equal, you can _always_ send weak tag when using gzip with any compression level, with -gzip suffix like that:

ETag: W/"76e23-1835-4156af5e53ac0-gzip"
Comment 38 Anshul 2013-07-17 16:45:45 UTC
The server response contains proper Vary header clearly indicating the response varies depending whether client is able to accept gzip content or not.

In that case, the responsibility lies with the intermediate proxy to make sure all conditional headers check are met before sending a cached response for an ETag.

For the implications mentioned in the description:
* The repeat request from same client will have same value for "Accept-Encoding" header as well as User Agent string meaning Apache has sufficient information to decide whether to send plain text or gzipped response. If-None-Match can have same ETag value in both case and still server should have no problem deciding which response to send.

* The above logic covers range queries as well.
Comment 39 Henrik Nordstrom 2013-07-18 06:23:24 UTC
This has already been discussed to death and still comes back...

There is no escape from the rule that each variant of a given URL MUST have a unique ETag value, or none at all. How the ETag value is formed is entirely up ot the server implementation and may carry any amount of unstructured and structured data as needed by the server to uniquely identify a variant.

Weak ETags have slightly different rules but is irrelevant to this discussion. Applies i.e. to when using dynamic adjustment of gzip encoding levels based on CPU load, but not for identity vs gzip encoded variant which are semantically different.

(In reply to Anshul from comment #38)
> The server response contains proper Vary header clearly indicating the
> response varies depending whether client is able to accept gzip content or
> not.

No. The server has sent a Vary header indicating that the servers variant selection depends on the content of the Accept-Encoding header in the request, and quite often User-Agent as well.

> In that case, the responsibility lies with the intermediate proxy to make
> sure all conditional headers check are met before sending a cached response
> for an ETag.

No. It's the origin servers responsibility to perform variant selection. Caches uses If-None-Match to ask the server which variant among a set of known cached variants of the requested URL should be used in response to unknown request combinations. The response to such requests ONLY says "Use the variant with ETag XXXX".

Semantically transparent proxies are not allowed to guess what variant selection preferences the server has. I.e. which browsers it had blacklisted content of type X for etc, or which browsers the server knows handles gzip content encoding when there is not Accept-encoding header present.

gzip compression is onlhy a tiny tiny little bit of server side variant selection. The same mechanism for selecting the correct variant amont a set of cached variants of a URL is used for a vide variety of response variance (selection of content-encoding, content-language, content-type, browser based, custom headers, etc etc)

Note: Apache mod_negotiation does the right thing in all cases known. Issues only arise with dynamic content encoding with mod_deflate (and a numbe of other similar modules performing dynamic content transformation) often forgetting about the meaning of ETag and it's relation to If-None-Match.
Comment 40 Mark Nottingham 2014-04-30 10:01:26 UTC
Just so folks know, the authoritative text on this topic will soon be:
  http://tools.ietf.org/html/draft-ietf-httpbis-p4-conditional-26