Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Having a cache on mappers and reducers could be very useful for some use cases, including but not limited to:

      1. Iterative Map Reduce Programs: Some machine learning algorithms frequently need access to invariant data (see Mahout) over each iteration of MapReduce until convergence. A cache on such nodes could allow easy access to the hotset of data without going all the way to the distributed cache. This optimization has been described by Jimmy Lin et. al in the paper "Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework" (http://hcil2.cs.umd.edu/trs/2009-01/2009-01.pdf)

      2. Storing of intermediate map outputs in memory to reduce shuffling time. This optimization has been discussed at length in Haloop (http://www.ics.uci.edu/~yingyib/papers/HaLoop_camera_ready.pdf), and by Shubin Zhang in "Accelerating MapReduce with Distributed Memory Cache" presented at ICPADS 2009.

      There are some other scenarios as well where having a cache could come in handy.

      JSR 107 aims to standardize caching interfaces for Java Application and popular caching solutions such as Ehcache and Memcached have JSR 107 wrapper. Hence, tt will be nice to have some sort of pluggable support for JSR 107 compliant caches on Hadoop.

        Activity

        Hide
        Weiming Shi added a comment -

        Can we leverage the work of guava?

        Show
        Weiming Shi added a comment - Can we leverage the work of guava?
        Hide
        Dhruv Kumar added a comment -

        For the impatient, I have uploaded a presentation about Haloop which I gave some time back in graduate school: http://www.slideserve.com/dkumar/optimizing-iterative-mapreduce-jobs

        Show
        Dhruv Kumar added a comment - For the impatient, I have uploaded a presentation about Haloop which I gave some time back in graduate school: http://www.slideserve.com/dkumar/optimizing-iterative-mapreduce-jobs
        Hide
        Dhruv Kumar added a comment -

        Ahmed, definitely, another advantage of having a larger, pluggable MapOutputBuffer is the potential reduction of Speculative Execution on other nodes which should improve the network performance in the cases of unbalanced clusters.

        Kapil, the Haloop paper which I linked in this JIRA describes the storing of intermediate map results for consumption by reducers. You can find their Apache Licensed code on Google Code, if you want to dive down into the specifics.

        Here's another related use case of using Memcached (or any other caching layer) with Hadoop, although this is a slightly different "plugging" point: http://www.slideserve.com/layne/mapreduce-and-databases.

        Show
        Dhruv Kumar added a comment - Ahmed, definitely, another advantage of having a larger, pluggable MapOutputBuffer is the potential reduction of Speculative Execution on other nodes which should improve the network performance in the cases of unbalanced clusters. Kapil, the Haloop paper which I linked in this JIRA describes the storing of intermediate map results for consumption by reducers. You can find their Apache Licensed code on Google Code, if you want to dive down into the specifics. Here's another related use case of using Memcached (or any other caching layer) with Hadoop, although this is a slightly different "plugging" point: http://www.slideserve.com/layne/mapreduce-and-databases .
        Hide
        Ahmed Radwan added a comment -

        Thanks Dhruv, This is interesting. I think the work on having pluggable MapOutputBuffer and shuffle can highly facilitate such effort.

        Show
        Ahmed Radwan added a comment - Thanks Dhruv, This is interesting. I think the work on having pluggable MapOutputBuffer and shuffle can highly facilitate such effort.
        Hide
        kapil bhosale added a comment -

        How can we use Distributed Cache (Memcached) to store intermediate results after Map phase, so that those can be used in Reduce phase from Cache.

        Show
        kapil bhosale added a comment - How can we use Distributed Cache (Memcached) to store intermediate results after Map phase, so that those can be used in Reduce phase from Cache.
        Hide
        Greg Luck added a comment -

        I am the spec lead for JSR107. It is at 0.5 right now in Maven Central. By using this you can allow all of the caching implementations to act as caching providers.

        Show
        Greg Luck added a comment - I am the spec lead for JSR107. It is at 0.5 right now in Maven Central. By using this you can allow all of the caching implementations to act as caching providers.
        Hide
        Dhruv Kumar added a comment -

        From the email thread on the Hadoop User mailing list:

        -----------------------------------------

        Please open a jira, we can discuss there.

        thanks,
        Arun

        Arun C. Murthy
        Hortonworks Inc.
        http://hortonworks.com/

        On Aug 14, 2012, at 2:22 PM, Dhruv wrote:

        Have there been any attempts to integrate JSR 107 compliant caches on mappers and reducers?

        There are some use cases where this will be beneficial, but I couldn't find any suitable plugging points for a cache on mappers or reducers without modifying the framework's code itself.

        I work for Terracotta Software, we have a JSR 107 wrapper for Ehcache and were wondering if the community will be interested in accepting a patch for such integration.

        Thanks,

        Dhruv

        Show
        Dhruv Kumar added a comment - From the email thread on the Hadoop User mailing list: ----------------------------------------- Please open a jira, we can discuss there. thanks, Arun – Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ On Aug 14, 2012, at 2:22 PM, Dhruv wrote: Have there been any attempts to integrate JSR 107 compliant caches on mappers and reducers? There are some use cases where this will be beneficial, but I couldn't find any suitable plugging points for a cache on mappers or reducers without modifying the framework's code itself. I work for Terracotta Software, we have a JSR 107 wrapper for Ehcache and were wondering if the community will be interested in accepting a patch for such integration. Thanks, – Dhruv

          People

          • Assignee:
            Unassigned
            Reporter:
            Dhruv Kumar
          • Votes:
            7 Vote for this issue
            Watchers:
            34 Start watching this issue

            Dates

            • Created:
              Updated:

              Development