Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1211

Online aggregation and continuous query support

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: task
    • Labels:
      None

      Description

      The purpose of this post is to propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We have built a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop, and can run unmodified user-defined MapReduce programs.

      For more information on the HOP design, please see our technical report.
      http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

      Further details are discussed in the following blog posts.
      http://databeta.wordpress.com/2009/10/18/mapreduce-online/
      http://radar.oreilly.com/2009/10/pipelining-and-real-time-analytics-with-mapreduce-online.html
      http://dbmsmusings.blogspot.com/2009/10/analysis-of-mapreduce-online-paper.html

      The HOP code has been published at the following location.
      http://code.google.com/p/hop/

        Activity

        Hide
        Neil Conway added a comment -

        BTW, there are a few different ways to slice this work:

        (1) Basic pipelining support for moving data between tasks
        (2) Pipelining data between jobs (do the HDFS write in the background)
        (3) Leveraging #1 and #2 to do "online aggregation" (approximate answers as the query runs)
        (4) Leveraging #1 and #2 to do "stream processing" (MR jobs that run continuously)

        These are in increasing order of complexity/invasiveness. If there's any interest in merging this work, we might begin by just merging #1, which is relatively low-impact. Intra-job pipelining offers two benefits: (1) potentially reduced response times (better cluster utilization), by overlapping map computation with network I/O and reducer-side merging (2) more effective straggler handling (rather than starting a speculative task from scratch, we can use pipelining to "checkpoint" the partial work done by the straggler, and not bother re-sending that portion of the map output over the network again).

        Show
        Neil Conway added a comment - BTW, there are a few different ways to slice this work: (1) Basic pipelining support for moving data between tasks (2) Pipelining data between jobs (do the HDFS write in the background) (3) Leveraging #1 and #2 to do "online aggregation" (approximate answers as the query runs) (4) Leveraging #1 and #2 to do "stream processing" (MR jobs that run continuously) These are in increasing order of complexity/invasiveness. If there's any interest in merging this work, we might begin by just merging #1, which is relatively low-impact. Intra-job pipelining offers two benefits: (1) potentially reduced response times (better cluster utilization), by overlapping map computation with network I/O and reducer-side merging (2) more effective straggler handling (rather than starting a speculative task from scratch, we can use pipelining to "checkpoint" the partial work done by the straggler, and not bother re-sending that portion of the map output over the network again).
        Hide
        Otis Gospodnetic added a comment -

        Out of curiosity, are there any plans to either:

        • get HOP or HOP-like functionality into Hadoop
        • get HOP to the latest version of Hadoop
          Thanks!
        Show
        Otis Gospodnetic added a comment - Out of curiosity, are there any plans to either: get HOP or HOP-like functionality into Hadoop get HOP to the latest version of Hadoop Thanks!
        Hide
        Edward Capriolo added a comment -

        Why not close this? Seems zombified.

        Show
        Edward Capriolo added a comment - Why not close this? Seems zombified.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tyson Condie
          • Votes:
            0 Vote for this issue
            Watchers:
            26 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development