Crunch
  1. Crunch
  2. CRUNCH-133

Add Aggregator support for combineValues ops on secondary keys via maps and collections

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Sawzall has a neat trick where you can do aggregations on secondary keys via maps, which is useful in cases where you might want to aggregate some data at (for example) both a country and at a city level within a single MapReduce job. We had a thread on crunch-user about this pattern:

      http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201212.mbox/%3CCAH29n6O-aHXTPHCRpSuAkAGUjvDR%3D56%3D-OLq9K9mZje%2BwVB4-Q%40mail.gmail.com%3E

      The pattern ends up looking something like this:

      // Define a table that has long values at both the K and the <K, String> levels.
      PTable<K, Pair<Long, Map<String, Long>>> in = ...;

      // Define and apply an Aggregator that can handle sums at both levels within a single MR job.
      Aggregator<Pair<Long, Map<String, Long>>> a = pairAggregator(SUM_LONGS(), map(Aggregators.SUM_LONGS()));
      PTable<K, Pair<Long, Map<String, Long>>> out = in.groupByKey().combineValues(a);

      ...which would run substantially faster than executing two dependent MR jobs, one that did the city aggregation and then a second follow-up job that did the country aggregation.

        Activity

        Hide
        Gabriel Reid added a comment -

        Yep, just using AggregatorFactory for the collections and maps methods sounds good to me. And yeah, having lambdas in Java 8 will come in handy for stuff like this.

        Show
        Gabriel Reid added a comment - Yep, just using AggregatorFactory for the collections and maps methods sounds good to me. And yeah, having lambdas in Java 8 will come in handy for stuff like this.
        Hide
        Josh Wills added a comment -

        Gabriel, thanks for the review. I agree w/the issues you point out, and that the complexity this patch introduces isn't clearly worth the benefit.

        We could go back to returning AggregatorFactory instances instead of Aggregator instances from the factory methods, but again, that imposes a cognitive cost that may not be worthwhile for the one use case. it would seem simpler to me to limit the collections() and maps() aggregator methods to taking in AggregatorFactory instances and leaving everything else (i.e., the general case) alone. This is one of those times where Java's lack of first-class functions is a real pain.

        Show
        Josh Wills added a comment - Gabriel, thanks for the review. I agree w/the issues you point out, and that the complexity this patch introduces isn't clearly worth the benefit. We could go back to returning AggregatorFactory instances instead of Aggregator instances from the factory methods, but again, that imposes a cognitive cost that may not be worthwhile for the one use case. it would seem simpler to me to limit the collections() and maps() aggregator methods to taking in AggregatorFactory instances and leaving everything else (i.e., the general case) alone. This is one of those times where Java's lack of first-class functions is a real pain.
        Hide
        Gabriel Reid added a comment -

        Sorry for the super slow uptake on this Josh. I started looking at this last week, and it took a while to grok it, and by then I was out of time to really review it – and then it just didn't happen. Thanks for the reminder anyhow.

        I've added some comments (or actually repeated the same comment over and over) on RB, but I also wanted to comment on the general approach here.

        I can get that this is a use case that could be useful, but (in what seems to be a common thread for me lately) I'm a bit hesitant to add the arity and especially the copy methods to Aggregator, as these seem very specific to this use case, and it seems a shame to add two methods to what is otherwise a pretty simple and clean interface for this one use case.

        I was thinking (although I'm not totally sure) that the Aggregator interface could possibly be split out into two interfaces, something like UnaryAggregator (which doesn't include an arity method, and could probably just be called Aggregator) and an ArityNAggregator, which would have an arity method (and would need a better name than ArityNAggregator). The CollectionAggregator and MapAggregator would then only use UnaryAggregators. Anyhow, it's just an idea, and I might be missing some important details.

        I'm actually more worried about the copy method, because it feels like that can be a risky/tricky one to do properly. As I mentioned on RB, I think that a call to copy on a decorating Aggregator should probably call copy on the underlying Aggregators instead of using the same reference. Even then, I worry about issues around transient state that comes from Configurations being properly copied in implementations of copy, etc.

        As an alternative, how about just using an AggregatorFactory (or UnaryAggregatorFactory) in the MapAggregator and CollectionAggregator? Then there's no need to call copy everywhere.

        Show
        Gabriel Reid added a comment - Sorry for the super slow uptake on this Josh. I started looking at this last week, and it took a while to grok it, and by then I was out of time to really review it – and then it just didn't happen. Thanks for the reminder anyhow. I've added some comments (or actually repeated the same comment over and over) on RB, but I also wanted to comment on the general approach here. I can get that this is a use case that could be useful, but (in what seems to be a common thread for me lately) I'm a bit hesitant to add the arity and especially the copy methods to Aggregator, as these seem very specific to this use case, and it seems a shame to add two methods to what is otherwise a pretty simple and clean interface for this one use case. I was thinking (although I'm not totally sure) that the Aggregator interface could possibly be split out into two interfaces, something like UnaryAggregator (which doesn't include an arity method, and could probably just be called Aggregator) and an ArityNAggregator, which would have an arity method (and would need a better name than ArityNAggregator). The CollectionAggregator and MapAggregator would then only use UnaryAggregators. Anyhow, it's just an idea, and I might be missing some important details. I'm actually more worried about the copy method, because it feels like that can be a risky/tricky one to do properly. As I mentioned on RB, I think that a call to copy on a decorating Aggregator should probably call copy on the underlying Aggregators instead of using the same reference. Even then, I worry about issues around transient state that comes from Configurations being properly copied in implementations of copy, etc. As an alternative, how about just using an AggregatorFactory (or UnaryAggregatorFactory) in the MapAggregator and CollectionAggregator? Then there's no need to call copy everywhere.
        Hide
        Josh Wills added a comment -

        @Gabriel/Matthias-- opinions on this one? I like it because it's something that is hard to do in Pig or Hive, but very natural in Crunch.

        Show
        Josh Wills added a comment - @Gabriel/Matthias-- opinions on this one? I like it because it's something that is hard to do in Pig or Hive, but very natural in Crunch.
        Hide
        Gabriel Reid added a comment -

        Change to New Feature (from Bug) so that it comes up in the right place in the release notes later on.

        Show
        Gabriel Reid added a comment - Change to New Feature (from Bug) so that it comes up in the right place in the release notes later on.
        Hide
        Josh Wills added a comment -

        First pass at this, will send it out on ReviewBoard as well.

        Show
        Josh Wills added a comment - First pass at this, will send it out on ReviewBoard as well.

          People

          • Assignee:
            Unassigned
            Reporter:
            Josh Wills
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development