Description
Broadcast join produces incorrect columns in join result, see below for an example. The same join but without using broadcast gives the correct columns.
Running PySpark on YARN on Amazon EMR 5.0.0.
import pyspark.sql.functions as func keys = [ (54000000, 0), (54000001, 1), (54000002, 2), ] keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1) keys_df.show() # +--------+-----+ # | key_id|value| # +--------+-----+ # |54000000| 0| # |54000001| 1| # |54000002| 2| # +--------+-----+ data = [ (54000002, 1), (54000000, 2), (54000001, 3), ] data_df = spark.createDataFrame(data, ['key_id', 'foo']) data_df.show() # +--------+---+ # | key_id|foo| # +--------+---+ # |54000002| 1| # |54000000| 2| # |54000001| 3| # +--------+---+ ### INCORRECT ### data_df.join(func.broadcast(keys_df), 'key_id').show() # +--------+---+--------+ # | key_id|foo| value| # +--------+---+--------+ # |54000002| 1|54000002| # |54000000| 2|54000000| # |54000001| 3|54000001| # +--------+---+--------+ ### CORRECT ### data_df.join(keys_df, 'key_id').show() # +--------+---+-----+ # | key_id|foo|value| # +--------+---+-----+ # |54000000| 2| 0| # |54000001| 3| 1| # |54000002| 1| 2| # +--------+---+-----+