Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7075

Project Tungsten (Spark 1.5 Phase 1)

    Details

    • Epic Name:
      Tungsten Phase 1
    • Target Version/s:

      Description

      Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of the underlying hardware.

      Memory Management and Binary Processing

      • Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead.
      • Minimizing memory usage through denser in-memory data format, which means we spill less.
      • Better memory accounting (size of bytes) rather than relying on heuristics
      • For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization

      Cache-aware Computation

      • Faster sorting and hashing for aggregations, joins, and shuffle

      Code Generation

      • Faster expression evaluation and DataFrame/SQL operators
      • Faster serializer

      Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics about the application. We will also retrofit the improvements onto Spark’s RDD API whenever possible.

      This epic tracks work items for Spark 1.5. More tickets can be found in:

      SPARK-7075: Tungsten-related work in Spark 1.5
      SPARK-9697: Tungsten-related work in Spark 1.6

        Issue Links

          Issues in Epic

            Activity

            Hide
            irashid Imran Rashid added a comment -

            everything sounds awesome, but can we see design docs & longer review timelines? There are a lot of massive changes proposed here.

            Show
            irashid Imran Rashid added a comment - everything sounds awesome, but can we see design docs & longer review timelines? There are a lot of massive changes proposed here.
            Hide
            rxin Reynold Xin added a comment -

            Yup I will post more thoughts and plans in the next few days.

            Show
            rxin Reynold Xin added a comment - Yup I will post more thoughts and plans in the next few days.
            Hide
            ilganeli Ilya Ganelin added a comment -

            This looks like the result of a large internal Databricks effort - are there pieces of this where you could use external help or is this issue in place primarily to document migration of internal code?

            Show
            ilganeli Ilya Ganelin added a comment - This looks like the result of a large internal Databricks effort - are there pieces of this where you could use external help or is this issue in place primarily to document migration of internal code?
            Hide
            irashid Imran Rashid added a comment -

            Thanks Reynold. But to be clear, there was another component of the request – I would like to have longer public review periods for massive changes.

            Show
            irashid Imran Rashid added a comment - Thanks Reynold. But to be clear, there was another component of the request – I would like to have longer public review periods for massive changes.
            Hide
            rxin Reynold Xin added a comment -

            Actually please go review it. It is hidden under the flag that is not turned on, and has almost no changes to existing code. Would love more feedback.

            Show
            rxin Reynold Xin added a comment - Actually please go review it. It is hidden under the flag that is not turned on, and has almost no changes to existing code. Would love more feedback.

              People

              • Assignee:
                rxin Reynold Xin
                Reporter:
                rxin Reynold Xin
              • Votes:
                9 Vote for this issue
                Watchers:
                79 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Development