Bug 38642 - mod_rewrite adds path info postfix after a substitution occured
Summary: mod_rewrite adds path info postfix after a substitution occured
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_rewrite (show other bugs)
Version: 2.5-HEAD
Hardware: Other other
: P2 normal with 16 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
: 33022 40598 41423 44922 46165 (view as bug list)
Depends on:
Blocks: 58573
  Show dependency tree
 
Reported: 2006-02-14 15:17 UTC by Bob Ionescu
Modified: 2018-03-26 22:51 UTC (History)
8 users (show)



Attachments
Rewritelog (2.19 KB, text/plain)
2006-02-14 15:18 UTC, Bob Ionescu
Details
Rush patch used some time ago against 2.0.55 (2.52 KB, patch)
2006-09-25 10:52 UTC, Bob Ionescu
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Bob Ionescu 2006-02-14 15:17:04 UTC
Testcase:
-> RewriteRules in per-dir context, /.htaccess (root)
-> used RewriteRules:

RewriteEngine on
RewriteCond %{REQUEST_URI} !^/(test|robots\.txt)
RewriteRule ^(.*) test/$1 [C]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^test/(.*)$ /test/index.php?q=$1 [L,QSA]

Description: Eyery Request should go to /test/... and furthermore if the file
does not exist in /test/..., it should go to /test/index.php?q=...

-> Request: /CSS/style.css
-> expected result: /test/index.php?q=CSS/style.css (all cond. are true)
-> real result: /test/index.php?q=CSS/style.css/style.css

Problem: see attached Rewritelog

The 1st RewriteRule rewrites the request to the filepath test/CSS/style.css
while the 2nd RewriteRule matches now against test/CSS/style.css/style.css and
not against the expected filepath test/CSS/style.css, because the
"/style.css"-part is handled as path info, which is added again to the new filepath.

I would expect once a substitution occurs, path info from the original request
isn't added anymore in this round of processing.
Comment 1 Bob Ionescu 2006-02-14 15:18:11 UTC
Created attachment 17689 [details]
Rewritelog
Comment 2 Ian Brandt 2006-05-23 16:43:54 UTC
A possible second example for verification purposes:

In per-directory .htaccess:

RewriteRule ^(.*)/index\.html$ $1/

The RewriteLog I see (stripped down and paraphrased for clarity):

(3) add path info postfix: /document/root/foo ->
/document/root/foo/index.html
(3) strip per-dir prefix: /document/root/foo/index.html -> foo/index.html
(3) applying pattern '^(.*)/index\.html$' to uri 'foo/index.html'
(2) rewrite foo/index.html -> foo/
(3) add per-dir prefix: foo/ -> /document/root/foo/
(3) add path info postfix: /document/root/foo/ ->
/document/root/foo//index.html
[...]

The expected behavior is that 'foo/index.html' would be rewritten to 'foo/',
however the "add path info postfix" step just adds it right back on.
Comment 3 Bob Ionescu 2006-09-25 10:41:11 UTC
*** Bug 33022 has been marked as a duplicate of this bug. ***
Comment 4 Bob Ionescu 2006-09-25 10:52:12 UTC
Created attachment 18910 [details]
Rush patch used some time ago against 2.0.55

I didn't port it into trunk since I hope that there's another method instead of
using r->notes.
Comment 5 Bob Ionescu 2007-01-20 07:29:45 UTC
*** Bug 41423 has been marked as a duplicate of this bug. ***
Comment 6 Zach Gorman 2007-02-22 19:09:50 UTC
This looks very similar to the old bug http://archive.apache.org/gnats/7879.
Comment 7 Charles H. 2007-12-19 03:27:30 UTC
I'm still experiencing this bug in 2.2.6.
Comment 8 Bob Ionescu 2008-08-14 02:42:02 UTC
*** Bug 44922 has been marked as a duplicate of this bug. ***
Comment 9 Seth Purcell 2008-11-12 07:31:34 UTC
I just wasted a good amount of time re-discovering and working around this bug (seven years after it was initially reported?!), as apparently many others continue to do (since it keeps popping up as a new bug) and I felt I should add my voice to the call to fix it, lest it continue to be neglected.

Why hasn't the mod_rewrite team fixed such a simple bug after so long? Why hasn't anyone even taken ownership of this bug (still has status NEW after 2.5 years on bugzilla?) Is there anyone currently working on mod_rewrite, anyone at all?

I don't think the masses of Apache users such as myself should have to apply a patch that's been out for more than two years - I'm not familiar enough with the code to immediately see the non-idealities of the patch, but if the patch is "good enough", please just roll it out and worry about perfecting it later. Your users will thank you.
Comment 10 Aleksander Budzynowski 2008-11-22 18:06:05 UTC
While I agree totally with Seth in that this bug should have been fixed long ago, we cannot simply rush a patch through. This needs to be done properly.

Because if there are any problems with the patch, who knows how long *they* will take to be addressed?

For the purpose of helping anyone who might fix this bug, I will supply some of the details that I provided when I reported the bug.


The problem:

If multiple RewriteRules within a .htaccess file match, unwanted copies of PATH_INFO may accumulate at the end of the URI.

In more depth:

When you make a request for a file in a directory that doesn't exist, Apache divides the URI into the "real" part (r->filename) and the "virtual" part (r->path_info). This has happened by the time per-directory (.htaccess) RewriteRules are ready to be processed, which means mod_rewrite has to put them back together to reconstruct the original URI.

This is what happens. r->path_info is appended to ctx->uri prior to *each* RewriteRule. If a RewriteRule (including its RewriteConds) does not match, ctx->uri is discarded and nothing bad happens. If a RewriteRule does match, however, the entire substitution is incorporated into r->filename.

But r->path_info is not changed! This means that subsequent rules will get an extra copy of PATH_INFO. If more rules match then this can get worse.

Note that PATH_INFO *should* be appended before the first matching RewriteRule, because it forms part of the URI. However, afterwards it should not be appended again.


Example:

This comes from a .htaccess file placed in DocumentRoot. It is supposed to replace all underscores in a URI with hyphens.

RewriteEngine On
RewriteBase /
RewriteRule ^(.*)_(.*)$ $1-$2 [N]

Make a request for "/_f_o_o_" and it will be correctly rewritten to "/-f-o-o-". (That's because PATH_INFO is empty.)

Make a request for "/_f_o_o_/bar" and it will be rewritten to "/-f-o-o-/bar/bar/bar/bar". (That is, unless you happen to have a _f_o_o_ directory, in which case PATH_INFO will be empty and the rewriting will work as desired.)

Note that there are four underscores but only three erroneous copies of  PATH_INFO - this is because the first time the rule matches, appending PATH_INFO is correct behaviour.

Make a request for "/foo/b_ar" and an infinite loop will ensue, since every time an underscore is replaced, a new one will be appended prior to the next rule.

(See my bug report at https://issues.apache.org/bugzilla/show_bug.cgi?id=44922 for a RewriteLog.)


The current patch:

Looks pretty good to me. It basically uses a flag to indicate whether a substitution has been made yet, and clears this flag when it ought to get "reset".

I would however check perdir *before* making each calls to apr_table_xxx, as these calls are relatively slow.

The flag is stored in r->notes - I can't see where better to put it. It can't go in the ctx struct because the scope of that struct is a single rewrite list. (This is bad because multiple .htaccess files can match a single request.) Although for performance, a "cache" of the flag could be stored in the ctx struct, and saved to r->notes in between rewrite lists.
Comment 11 Eric Covener 2008-11-22 18:40:14 UTC
> The flag is stored in r->notes - I can't see where better to put it. It can't
> go in the ctx struct because the scope of that struct is a single rewrite list.
> (This is bad because multiple .htaccess files can match a single request.)
> Although for performance, a "cache" of the flag could be stored in the ctx
> struct, and saved to r->notes in between rewrite lists.

I am leaning towards whacking r->path_info as soon as we know we've effectively made it useless by replacing r->filename in per-directory context, rather then thinking harder about when we should spill r->path_info into ctx->uri by saving info away in some previous step.

@@ -3981,6 +3986,11 @@
     /* Now adjust API's knowledge about r->filename and r->args */
     r->filename = newuri;
+
+    if (ctx->perdir && r->path_info && !(p->flags & RULEFLAG_KEEPPATHINFO)) {
+        r->path_info = '\0';
+    }

But I don't pretend to understand how someone might be relying on this path_info from the original request floating around, so the meaning of the flag might have to be inverted so that the current behavior is default.

Who knows how many .htaccess files are carefully working around this behavior by  e.g. not propogating pieces of the URL that look like the existing %{PATH_INFO} over in the substitution
Comment 12 Aleksander Budzynowski 2008-11-23 14:01:51 UTC
>> I am leaning towards whacking r->path_info as soon as we know we've effectively made it useless by replacing r->filename in per-directory context, rather then thinking harder about when we should spill r->path_info into ctx->uri by saving info away in some previous step.

That's exactly what I thought at first. But then I found this: http://213.11.80.10/manual/cgi_path.html

And I quote: "Apache 1.2 and later now determine SCRIPT_NAME and PATH_INFO by looking directly at the URL"

So maybe it's not such a good idea.
Comment 13 Eric Covener 2008-11-23 14:12:55 UTC
(In reply to comment #12)
> >> I am leaning towards whacking r->path_info as soon as we know we've effectively made it useless by replacing r->filename in per-directory context, rather then thinking harder about when we should spill r->path_info into ctx->uri by saving info away in some previous step.
> 
> That's exactly what I thought at first. But then I found this:
> http://213.11.80.10/manual/cgi_path.html
> 
> And I quote: "Apache 1.2 and later now determine SCRIPT_NAME and PATH_INFO by
> looking directly at the URL"
> 
> So maybe it's not such a good idea.
> 

I thought that since this is per-directory, PT-like behavior is implied so whenever we change the filename and re-inject into apache, so PATH_INFO is very soon to be "fixed" by the core.

The only time it lasts longer is when we have another rule or two to look at before re-injecting.
Comment 14 Seth Purcell 2008-11-23 20:25:44 UTC
I'm encouraged by the recent activity on this bug, but with all the discussion about possible approaches and their potential ramifications, and the link to the reference describing version 1.2 and later behavior, I'm left wondering: isn't there a set of unit test cases for this function to unequivocally define its requirements, and indicate when an acceptable solution has been found? This seems like a project that would have a large, if not exhaustive, stable of test cases built up by now.
Comment 15 Eric Covener 2008-11-24 04:46:03 UTC
(In reply to comment #14)
> isn't there a set of unit test cases for this function to unequivocally define
> its requirements, and indicate when an acceptable solution has been found? 

Nothing that can be relied upon for this purpose. Any current functional regression tests pass today, and we wouldn't pretend to define the requirements for something that could be relied on by any number of .htaccess files for such a long span of time.  There is also the issue of interaction with what third-party modules or CGI might expect.

Comment 16 Aleksander Budzynowski 2008-11-24 15:10:45 UTC
(In reply to comment #13)
> I thought that since this is per-directory, PT-like behavior is implied so
> whenever we change the filename and re-inject into apache, so PATH_INFO is very
> soon to be "fixed" by the core.
> 
> The only time it lasts longer is when we have another rule or two to look at
> before re-injecting.
 
I could be wrong, but the way I understand it, PATH_INFO is just derived from the original URL - before any rewriting occurs. And is never "fixed".
Comment 17 Bob Ionescu 2008-11-25 12:49:56 UTC
(In reply to comment #16)
> (In reply to comment #13)
> > I thought that since this is per-directory, PT-like behavior is implied so
> > whenever we change the filename and re-inject into apache, so PATH_INFO is very
> > soon to be "fixed" by the core.
> > 
> > The only time it lasts longer is when we have another rule or two to look at
> > before re-injecting.
> 
> I could be wrong, but the way I understand it, PATH_INFO is just derived from
> the original URL - before any rewriting occurs. And is never "fixed".

Yes, derived from the physical filename after the request has been mapped to the file system. I think Eric meant that it is fixed in the redirect processing (new file system mapping, new path info for the internal redirect processing).

I think erasing path_info is not a good idea. Even if it seems to be useless (since you can check the URL-path), someone (either content handlers or e.g. RewriteConditions) may relay on that ENV. That's why I've chosen to leave it untouched but prevent further appending if a substitution occurred.

I cannot imagine a case where "not appending path_info after a subst. was applied but leaving r->path_info untouched" can break any case – either at server/rewrite rule side nor at script side. No one uses a construct of rewrite rules, which expects continuing appending of path_info – and I've seen a lot of rules in a german discussion forum about mod_rewrite. It's the opposite. There were questions "what went wrong" when it comes to the behavior described in this PR.

(In reply to comment #10)
> I would however check perdir *before* making each calls to apr_table_xxx, as
> these calls are relatively slow.

As I wrote – rush.  :-) I think most calls were made in sections which are touched by the fix-up hook (i.e. directory context) only. May be a check for path_info != NULL would be better.
Comment 18 Aleksander Budzynowski 2008-11-25 14:04:41 UTC
> for something that could be relied on by any number of .htaccess files for such
> a long span of time.

I doubt many .htaccess rely on [that is, work-around] current behaviour, because:
-The behaviour differs from httpd.conf behaviour,
-is undocumented, and
-when I posted about this to the (English) Apache users mailing list, nobody knew about the issue.

> As I wrote – rush.  :-) I think most calls were made in sections which are
> touched by the fix-up hook (i.e. directory context) only. May be a check for
> path_info != NULL would be better.

Hmm, you're right for the most part. I was thinking you should have done:

if (perdir != NULL && apr_table_get(r->notes,"substapplied") &&

instead of
if (apr_table_get(r->notes,"substapplied") && perdir != NULL &&

except the latest code is structured a bit differently (the whole block already has a perdir check) so this is irrelevant.
Comment 19 Bob Ionescu 2008-12-14 06:15:20 UTC
*** Bug 46165 has been marked as a duplicate of this bug. ***
Comment 20 Eric Covener 2008-12-19 05:08:01 UTC
I committed an explicit flag to discard PATH_INFO during per-directory processing:

http://svn.apache.org/viewvc?rev=728015&view=rev

I'm a bit too concerned about breaking convoluted rulesets by changing the default behavior. I did consider changing the default in trunk/2.3.x and having an opt _out_ flag but then the 2.2.x/2.3.x differences would be a pain.

I would appreciate review of the doc in the commit (in viewsvn or URL when it's available) because it's difficult to find the right level of detail:

http://httpd.apache.org/docs/2.3/mod/mod_rewrite.html#rewriteflags
Comment 21 Bob Ionescu 2008-12-19 13:22:40 UTC
Another flag (to fix just a bug) makes mod_rewrite yet more complex (which is IMHO a problem). It is good that there is a way to avoid the bug, but as you can see from the last duplicate of this bug, I think many people trapping into this bug are not aware about what causes the problem or what path-info is at all. Further more, many don't have access to the server config and can't get a Rewritelog.

I still consider this as a bug and not as a feature, which means that a bug shouldn't need options to keep it.

There's a small typo: "or the resulf".

If it should be kept that way, consider adding a note to the N-flag about the new flag as well, because this bug is deadly in combination with the N-flag.
Comment 22 Bob Ionescu 2008-12-19 13:23:34 UTC
*** Bug 40598 has been marked as a duplicate of this bug. ***
Comment 23 Aleksander Budzynowski 2008-12-19 17:29:18 UTC
In regards to clobbering PATH_INFO:

From my experiments, it would seem that PATH_INFO is in fact "regenerated" after a mod_rewrite ruleset has been processed. I'm not quite sure at what stage this occurs, but this would mean that, at least, CGI scripts will not be affected by the change.

The only place I can think of that will be affected is RewriteRules and RewriteConds that use %{PATH_INFO} themselves. But I would say they are relying on undocumented behaviour - in any ruleset that allows more than one substitution to occur, the URL goes through a "partial" state and what PATH_INFO should contain at that point isn't really stated anywhere.

So I say forget the flag and change default behaviour - documenting it clearly, of course.

If someone really wants to preserve PATH_INFO, they can use the env= flag to store it in another variable before making any substitutions.


In terms of the code itself:

>   r->path_info = "\0";

Apache will automatically clean up the old string when the request is done, right? (Just sanity checking here.)

Also, why not just use
    r->path_info = NULL;
? All the rest of the mod_rewrite code supports this. If you have a good reason not to use NULL, then you could just use this:
    r->path_info = "";

Comment 24 Eric Covener 2008-12-19 18:04:54 UTC
(In reply to comment #23)
> In regards to clobbering PATH_INFO:
> 
> From my experiments, it would seem that PATH_INFO is in fact "regenerated"
> after a mod_rewrite ruleset has been processed. I'm not quite sure at what
> stage this occurs, but this would mean that, at least, CGI scripts will not be
> affected by the change.

This is the core directory walk processing that happens shortly after the rewritten URL is reinserted.

> 
> So I say forget the flag and change default behaviour - documenting it clearly,
> of course.

I think that definitely doesn't fly for 2.2 because I think it's quite easy to end up with a ruleset that works under

Changing the behavior between releases has complications too because you have to then know where your htaccess is running.

Ultimately I'm just more comfortable A) not regressing anyone with a legacy ruleset lurking out in htaccess and B) making the choice explicit in the rule after consulting the doc.


> Apache will automatically clean up the old string when the request is done,
> right? (Just sanity checking here.)

yes, everything in the request_rec is allocated out of a short lived (lifetime of a request) APR pool.

> 
> Also, why not just use

thanks, corrected.
Comment 25 Aleksander Budzynowski 2008-12-19 20:46:28 UTC
> Ultimately I'm just more comfortable A) not regressing anyone with a legacy
> ruleset lurking out in htaccess and B) making the choice explicit in the rule
> after consulting the doc.

I can see where you're coming from, but in my opinion, placing extra burden on users for backwards-compatibility reasons should be avoided as much as possible.

I've had an idea: rather than using r->path_info, we add a path_info member to rewrite_ctx, and refer to that within mod_rewrite. It would be initialised to r->path_info, but could be set to NULL when a substitution is made. This way the request's PATH_INFO information is not actually modified by mod_rewrite.

Any comments?


Earlier I said
> The flag is stored in r->notes - I can't see where better to put it. It can't
> go in the ctx struct because the scope of that struct is a single rewrite list.

I no longer think this is true. r->path_info is regenerated at the end of a rewrite list (providing at least one rule matched) so we can start afresh for each rewrite list.
Comment 26 Dale 2009-09-07 02:33:37 UTC
How I can to use this path on windows version of apache 2.2?
When this bug will be fixed in binaries?
Comment 27 Dale 2009-09-07 03:20:13 UTC
Sorry. Needs use DPI flag to fix this like.
Comment 28 Eric Covener 2009-09-07 06:42:14 UTC
marking fixed, Release in 2.2.12
Comment 29 Jake 2018-03-26 22:11:11 UTC
This is not fixed, because the fix is hiding behind a `DPI` flag.  It has just wasted several hours of my life, and presumably countless others', because it is not properly fixed.

Please make the correct behaviour the default.  I cannot see why anyone would want the broken behaviour, but that would be the behaviour to have a flag to enable.
Comment 30 Jake 2018-03-26 22:26:23 UTC
To expand on this, the bug is subtle, as it only occurs when part of the path matches a file or directory, thus many users will have this latent bug if they are not using the DPI flag and have not tested that specific case.
Comment 31 Eric Covener 2018-03-26 22:51:05 UTC
Please open a new bug if you want to suggest a default change for a future release. This can't safely be changed in a stable release.