With jclouds 1.9.3-SNAPSHOT and 2.0.0-SNAPSHOT (and all GA versions)...
If jclouds receives a crazily big http response (e.g. see https://github.com/jclouds/jclouds/pull/1020), then jclouds consumes a huge amount of memory while processing that response. It holds multiple copies of the response in-memory at the same time.
As reported in https://issues.apache.org/jira/browse/BROOKLYN-364, the memory usage (according to jmap -histo:live <pid>) is most char arrays (which is what backs the StringBuilder and String). When processing a 154MB response payload, the memory used by char arrays goes something like this: 66MB -> 104MB -> 323MB -> 625MB -> 939MB -> 461MB -> 633MB -> 327MB -> 21MB.
(I don't know how much to believe the :live: is that definitely the size of char arrays that could not be GC'ed?)
There are two main areas where this memory is consumed.
In the jclouds wire log, the following happens (approximately?!):
- We read the response input stream, writing it to a FileBackedOutputStream - nice!
- In Wire.wire(String, InputStream), we read the FileBackedOutputStream into a StringBuilder, and then call getWireLog().debug(buffer.toString()):
The StringBuilder holds the char; the toString() duplicates it - so two copies in memory.
Unfortunately in the Softlayer example, it's all one huge line, so we logged it all in one go.
I think (but need to dig into it more) that the logging framework (slf4j -> log4j in my case) ends up with multiple copies as well, while processing the call to log.debug(String). (Hence peaking at 939MB of char arrays in memory).
When the method returns, all these copies can be GC'ed.
- The response payload has now been switched to the FileBackedOutputStream, so that will be used subsequently.
To parse the HTTP response:
- The response is passed to org.jclouds.http.functions.ParseJson.apply(HttpResponse)
- This calls json.fromJson(Strings2.toStringAndClose(stream), type).
- The Strings2.toStringAndClose calls CharStreams.toString(new InputStreamReader(input, Charsets.UTF_8)).
This reads the stream into a StringBuilder, then calls toString - so holds two copies in memory.
This explains the second spike in memory usage (though I'm surprised it went as high as 633MB of char arrays in memory).
When the method returns, we have our one String.
Possible Improvements to Wire Log
org.jclouds.logging.internal.Wire could be configurable to only log the first X bytes of a response (so crazily long messages would be truncated in the log).
Alternatively/additionally, Wire.wire could force splitting a huge line into multiple log messages when calling getWireLog().debug(buffer.toString()). Again this could be configurable.
In production usage, I personally would always configure it to truncate: better to miss the end of the response rather than risk an OutOfMemoryError. Note this particular part isn't an issue if jclouds.wire is set to INFO or higher.
Possible Improvements to Json Parsing
I think ParseJson.apply(InputStream stream, Type type) should pass the new InputStreamReader(inputStream) to org.jclouds.json.Json.fromJson() (in a try-finally to close the stream, obviously).
This would require adding a new method to the interface org.jclouds.json.Json.
The implementation of GsonWrapper can handle that easily: it can call gson.fromJson(Reader, Type) instead of gson.fromJson(String, Type). It looks to me like that will parse the stream as it reads it, rather than having to hold the whole string in memory at once.
If we do these improvements, I don't think we'll ever hold the full 154MB char array in memory.