[SPARK-14679] UI DAG visualization causes OOM generating data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.1
Fix Version/s: 1.6.2, 2.0.0
Component/s: Web UI
Labels:
None

Description

The UI will hit an OutOfMemoryException when generating the DAG visualization data for large Hive table scans. The problem is that data is being duplicated in the output for each RDD like cluster10 here:

digraph G {
  subgraph clusterstage_1 {
    label="Stage 1";
    subgraph cluster7 {
      label="TungstenAggregate";
      9 [label="MapPartitionsRDD [9]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster10 {
      label="HiveTableScan";
      7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
      6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
      5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster10 {
      label="HiveTableScan";
      7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
      6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
      5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster8 {
      label="ConvertToUnsafe";
      8 [label="MapPartitionsRDD [8]\nrun at ThreadPoolExecutor.java:1142"];
    }
    subgraph cluster10 {
      label="HiveTableScan";
      7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
      6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
      5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
    }
  }
  8-&gt;9;
  6-&gt;7;
  5-&gt;6;
  7-&gt;8;
}

Hive has a large number of RDDs because it creates a RDD for each partition in the scan returned by the metastore. Each RDD in results in another copy of the. The data is built with a StringBuilder and copied into a String, so the memory required gets huge quickly.

The cause is how the RDDOperationGraph gets generated. For each RDD, a nested chain of RDDOperationCluster is produced and those are merged. But, there is no implementation of equals for RDDOperationCluster, so they are always distinct and accumulated rather than deduped.

Attachments

Issue Links

links to

[Github] Pull Request #12437 (rdblue)

Activity

People

Assignee:: Ryan Blue

Reporter:: Ryan Blue

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Apr/16 01:21

Updated:: 20/Apr/16 10:27

Resolved:: 20/Apr/16 10:27