Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4856 Optimization for pig on spark
  3. PIG-5024

add a physical operator to broadcast small RDDs

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • spark-branch
    • spark
    • None

    Description

      Currently, when optimize some kinds of JOIN, the indexed or sampling files are saved into HDFS. By setting the replication to a larger number, it serves as distributed cache.

      Spark's broadcast mechanism is suitable for this. It seems that we can add a physical operator to broadcast small RDDs.
      This will benefit the optimization of some specialized Joins, such as Skewed Join, Replicated Join and so on.

      Attachments

        1. PIG-5024.patch
          43 kB
          Xianda Ke
        2. PIG-5024_6.patch
          9 kB
          Xianda Ke
        3. PIG-5024_5.patch
          9 kB
          Xianda Ke
        4. PIG-5024_4.patch
          9 kB
          Xianda Ke
        5. PIG-5024_3.patch
          9 kB
          Xianda Ke
        6. PIG-5024_2.patch
          9 kB
          Xianda Ke

        Issue Links

          Activity

            People

              kexianda Xianda Ke
              kexianda Xianda Ke
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: