[HTTPCLIENT-1990] URIUtils.rewriteURI manges unicode characters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Invalid
Affects Version/s: 4.5.8
Fix Version/s: None
Component/s: HttpCache
Labels:
None

Description

The following test case illustrates a problem with URIUtils that I have encountered:

public class Main {
  public static void main(String[] args) throws Exception {
    URI uri = UriComponentsBuilder.fromUriString("https://host/path")
      .pathSegment("üñîçøðé")
      .build()
      .toUri();
    System.out.printf("rawPath = %s\n", uri.getRawPath());
    System.out.printf("path    = %s\n", uri.getPath());

    uri = URIUtils.rewriteURI(uri, null, URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
    System.out.printf("rawPath = %s\n", uri.getRawPath());
    System.out.printf("path    = %s\n", uri.getPath());
  }
}

The issue was encontered, since previous versions of httpclient didn't perform the path normalisation (the main caller is ProtocolExec in the HTTP client), and effectively only did URIUtils.DROP_FRAGMENT, so users who upgrade will get the new normalisation feature unexpectedly.

The bug exhibited by URIUtils.rewriteURI is actually caused by URLEncodedUtils.urlDecode (inside URIBuilder's ctor, which calls URIBuilder.parsePath), which does something truly nasty. It takes a String (a logical sequence of Unicode code points), casts it to a CharBuffer, then iterates over it, slicing the chars to bytes! Strange, but true.

Unicode characters in a java.net.URI are legal, as far as I can tell, and should be simply escaped as percent-encoded UTF-8 bytes as returned by URI.getRawPath - but! - not when returned unescaped by URI.getPath, which is what URIUtils.rewriteURI uses.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Wilson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/May/19 18:23

Updated:: 23/May/19 10:12

Resolved:: 23/May/19 10:12