Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.9.8
-
None
-
None
Description
MRQL data (MRData) are serialized as Writable (for Hadoop Map-Reduce), Java Serializable (for Spark), and CopyableValue (for Flink). Until now, the Spark MRQL engine was using a wrapper for MRData (called MRContainer) to serialize data using the Writable methods. Some data used in Spark mode though were left unwrapped, so Spark was using the default Java serialization, which was inefficient. With this patch, MRData becomes Serializable with custom serialization methods that are very efficient. My performance evaluation of the Pagerank query over 10 millions links run on a cluster with 16 cores gives 38% improvement compared to the old Spark evaluation.
Attachments
Issue Links
- links to