[SPARK-7075] Project Tungsten (Spark 1.5 Phase 1) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: Block Manager, Shuffle, Spark Core, SQL
Labels:
None

Epic Name:
Tungsten Phase 1
Target Version/s:

1.5.0

Description

Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of the underlying hardware.

Memory Management and Binary Processing

Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead.
Minimizing memory usage through denser in-memory data format, which means we spill less.
Better memory accounting (size of bytes) rather than relying on heuristics
For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization

Cache-aware Computation

Faster sorting and hashing for aggregations, joins, and shuffle

Code Generation

Faster expression evaluation and DataFrame/SQL operators
Faster serializer

Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics about the application. We will also retrofit the improvements onto Spark’s RDD API whenever possible.

This epic tracks work items for Spark 1.5. More tickets can be found in:

~~SPARK-7075~~: Tungsten-related work in Spark 1.5
~~SPARK-9697~~: Tungsten-related work in Spark 1.6

Attachments

Issue Links

relates to

SPARK-8159 Improve expression function coverage (Spark 1.5)

Resolved

Activity

People

Assignee:: Reynold Xin

Reporter:: Reynold Xin

Votes:: 9 Vote for this issue

Watchers:: 73 Start watching this issue

Dates

Created:: 23/Apr/15 07:05

Updated:: 12/Aug/15 17:09

Resolved:: 12/Aug/15 17:09