Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21795

Broadcast hint ignored when dataframe is cached

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.2.0
    • None
    • Documentation, SQL
    • None

    Description

      Not sure if it's a bug or by design, but if a DF is cached, the broadcast hint is ignored, and spark uses SortMergeJoin.

      val largeDf = ...
      val smalDf = ...
      smallDf = smallDf.cache
      
      largeDf.join(broadcast(smallDf))
      
      

      It make sense there's no need to use cache when using broadcast join, however, I wonder if it's the correct behavior for spark to ignore the broadcast hint just because the DF is cached. Consider a case when a DF should be cached for several queries, and on different queries it should be broadcasted.

      If this is the correct behavior, at least it's worth documenting that cached DF cannot be broadcasted.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              liorchaga Lior Chaga
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: