I was using the Web Proxy to capture a transaction on a website that was sending malformed ContentType in the response. The content type was of the format: charset=x;charset=x. After extraction the store of the sample failed showing a 503 error on the browser and a UnsupportedEncoding exception in the log (the encoding being passed to the String constructor being of the form "x;charset=x"). Looking into the error I saw that the problem was occurring in the Proxy.getContentEncoding method. I also noticed that there were three getContentEncoding methods at different levels (document, etc) all of which use slightly different versions of the indexOf followed by substring index strategy for extraction. I solved my problem by using the regexp below and defaulting to the platform encoding if the expression does not match. Pattern p = Pattern.compile("charset=([\\d\\w-]+)"); This simply extracted the first sequence in the charset. I think the Proxy needs to be more forgiving in its parsing of the Content-Type and if it is malformed use a sensible default as browsers do. I also think that the Content Type extractor should check whether the coding extracted is actually one supported by the Java platform that is running JMeter. It is easier to debug if the failure occurs closer to the source of the problem instead of failing on the store of the Sampler. This problem occurred on an SVN tip build of 28th of March and is still occurring as far as I know.
According to http://www.iana.org/assignments/character-sets charset names may be up to 40 characters of the US-ASCII character set, which would include ";" and space, tab etc. The document cited above lists some charset formal names which also include "." and "-". Unless you can find a formal definition of the allowable characters for a charset, I think it would be safest to assume only that ";" is not allowed. This is what has been fixed in SVN in http://svn.apache.org/viewvc?rev=648909&view=rev http://svn.apache.org/viewvc?rev=648910&view=rev http://svn.apache.org/viewvc?rev=648916&view=rev
(In reply to comment #1) > According to > > http://www.iana.org/assignments/character-sets > Unless you can find a formal definition of the allowable characters for a > charset, I think it would be safest to assume only that ";" is not allowed. I'm happy with this as a solution for the extraction. The refactor to a Helper class makes it easier to read the code too. But if the server specifies a charset that the JVM doesn't support you will still get an invalid captured sampler. I think that the new conversion method should check Charset.isSupported on the captured charset and if the answer is false return Charset.defaultCharset().name(). That way getEncodingFromContentType will always return a valid encoding for the platform.
OK, added check for unsupported Charset: http://svn.apache.org/viewvc?rev=650928&view=rev
This issue has been migrated to GitHub: https://github.com/apache/jmeter/issues/2094