Dmitriy Selivanov PySpark uses pickle and CloudPickle on python side and net.razorvine.pickle on JVM side for data serialization/deserialization between Python and JVM. While there lacks a library similar to net.razorvine.pickle which can deserialize from and serialize to R serialization format. So currently, SparkR depends on ReadBin()/writeBin() on R side and Java DataInputStream/DataOutputStream for serialization/deserialization between R and JVM, based on the fact that for simple types like integer, double, byte array, they share the same format.
For collect(), the serialization/deserialization happens along with the communication via socket. I suspect there are much communication overhead occurring during many socket reads/writes. Maybe we can change the behavior in batch way, that is, serialize part of the collection result into a buffer in memory and transfer it back. Would you interested in doing a prototype and see if there is any performance improvement?
Another idea would be introduce something like net.razorvine.pickle, but that sounds a lot of effort.