[MRQL-98] Improve Data Serialization in Spark Evaluation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.9.8
Fix Version/s: None
Component/s: Run-Time/Spark
Labels:
None

Description

MRQL data (MRData) are serialized as Writable (for Hadoop Map-Reduce), Java Serializable (for Spark), and CopyableValue (for Flink). Until now, the Spark MRQL engine was using a wrapper for MRData (called MRContainer) to serialize data using the Writable methods. Some data used in Spark mode though were left unwrapped, so Spark was using the default Java serialization, which was inefficient. With this patch, MRData becomes Serializable with custom serialization methods that are very efficient. My performance evaluation of the Pagerank query over 10 millions links run on a cluster with 16 cores gives 38% improvement compared to the old Spark evaluation.

Attachments

Issue Links

links to

GitHub Pull Request #28

Activity

People

Assignee:: Leonidas Fegaras

Reporter:: Leonidas Fegaras

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Oct/16 23:44

Updated:: 20/Oct/16 21:33

Resolved:: 20/Oct/16 21:33