[HIVE-8851] Broadcast files for small tables via SparkContext.addFile() and SparkFiles.get() [Spark Branch] - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: spark-branch
Component/s: Spark
Labels:
None

Description

Currently files generated by SparkHashTableSinkOperator for small tables are written directly on HDFS with a high replication factor. When map join happens, map join operator is going to load these files into hash tables. Since on multiple partitions can be process on the same worker node, reading the same set of files multiple times are not ideal. The improvment can be done by calling SparkContext.addFiles() on these files, and use SparkFiles.getFile() to download them to the worker node just once.

Please note that SparkFiles.getFile() is a static method. Code invoking this method needs to be in a static method. This calling method needs to be synchronized because it may get called in different threads.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-8851.1-spark.patch
01/Dec/14 17:35
15 kB
Jimmy Xiang
HIVE-8851.2-spark.patch
06/Mar/15 22:19
18 kB
Jimmy Xiang

Issue Links

depends upon

SPARK-4687 SparkContext#addFile doesn't keep file folder information

Resolved

is related to

HIVE-10302 Load small tables (for map join) in executor memory only once [Spark Branch]

Closed

Activity

People

Assignee:: Jimmy Xiang

Reporter:: Xuefu Zhang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 13/Nov/14 04:48

Updated:: 21/Oct/22 07:31