Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8699 Enable support for common map join [Spark Branch]
  3. HIVE-8851

Broadcast files for small tables via SparkContext.addFile() and SparkFiles.get() [Spark Branch]

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • spark-branch
    • Spark
    • None

    Description

      Currently files generated by SparkHashTableSinkOperator for small tables are written directly on HDFS with a high replication factor. When map join happens, map join operator is going to load these files into hash tables. Since on multiple partitions can be process on the same worker node, reading the same set of files multiple times are not ideal. The improvment can be done by calling SparkContext.addFiles() on these files, and use SparkFiles.getFile() to download them to the worker node just once.

      Please note that SparkFiles.getFile() is a static method. Code invoking this method needs to be in a static method. This calling method needs to be synchronized because it may get called in different threads.

      Attachments

        1. HIVE-8851.1-spark.patch
          15 kB
          Jimmy Xiang
        2. HIVE-8851.2-spark.patch
          18 kB
          Jimmy Xiang

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jxiang Jimmy Xiang Assign to me
            xuefuz Xuefu Zhang

            Dates

              Created:
              Updated:

              Slack

                Issue deployment