Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
There are a big performance difference in join between spark and mr mode.
daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); divs = load './NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); jnd = join daily by (exchange, symbol), divs by (exchange, symbol); store jnd into './join.out';
join.sh
mode=$1
start=$(date +%s)
./pig -x $mode $PIG_HOME/bin/join.pig
end=$(date +%s)
execution_time=$(( $end - $start ))
echo "execution_time:"$excution_time
The execution time:
mr | spark | |
---|---|---|
join | 20 sec | 79 sec |
You can download the test data NYSE_daily and NYSE_dividends in https://github.com/alanfgates/programmingpig/blob/master/data/
Attachments
Attachments
Issue Links
- relates to
-
PIG-4871 Not use OperatorPlan#forceConnect in MultiQueryOptimizationSpark
- Closed
- links to