Description
Since the data is serialized on the Python side, there's not much point in keeping it as byte arrays in Java, or even in skipping compression. We should make cache() in PySpark use MEMORY_ONLY_SER and turn on spark.rdd.compress for it.
Since the data is serialized on the Python side, there's not much point in keeping it as byte arrays in Java, or even in skipping compression. We should make cache() in PySpark use MEMORY_ONLY_SER and turn on spark.rdd.compress for it.