To fix this completely would need a significant retrofit of the client side, as well as some ability to do paginated batch retrieves from the metastore.
A quick solution that goes a good deal of the way, however, is as follows:
a) Changing some usages of List<Partition> to Iterable<Partition>, and have a PartitionIterable that implements the above interface to replace usages of List<Partition>, and have that class lazily fetch partitions on need. While having a pagination scheme from the metastore would be great, a good short term solution that's possible is to simply store the partition names rather than the entire partition objects, so a PartitionIterable can, in the meanwhile, get the partition names, and then handle the pagination itself.
This solves the oom issues on the metastore completely, and gets rid of the thrift copy problem as well as the List<Partition> deepcopy problem. It introduces a load of storing all the partition names, but this is far less costly than the above.
b) Changing the json serialization to output each element as they come, rather than constructing one large JSONObject, and writing that out in one go. This solves the large JSONObject problem.
This still does not solve the problem of having a large number of ReadEntities, but that's something that's better tacked by doing something like a metadata-only-export, or changing export to be able to export a partial partition specification at a time, both of which are the subjects of further jiras I will be filing shortly.