Hadoop Common
  1. Hadoop Common
  2. HADOOP-908

Hadoop Abacus, a package for performing simple counting/aggregation

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11.0
    • Component/s: None
    • Labels:
      None

      Description

      Hadoop Abacus package is a specialization of map/reduce framework,
      specilizing for performing various counting and aggregations.
      It offers similar functionalities to Google's SawZall.

      Generally speaking, in order to implement an application using Map/Reduce model,
      the developer needs to implement Map and Reduce functions (and possibly Combine function).
      However, for a lot of applications related to counting and statistics computing,
      these functions have very similar characteristics.
      Abacus abstracts out the general patterns and provides a package implementing those patterns.
      In particular, the package provides a generic mapper class, a reducer class and a combiner class,
      and a set of built-in value aggregators. It also provides a generic utility class, ValueAggregatorJob
      for creating Abacus jobs.

      To create an Abacus job, the user just needs to implement one plugin class that
      is responsible for specifying what aggregators to use and what values are for which aggregators.
      The mapper will call this class in the runtime to generate aggregation ids and values.
      The generic combiner and reducer will aggregate the values associated with the same
      aggregation ids accordingly. Thus, it is much easier to create and run an Abacus job than
      a normal map/reduce job. Since a built-in generic combiner is always used, the execution is very efficient.

      1. abacus.patch
        68 kB
        Runping Qi

        Issue Links

          Activity

          Runping Qi created issue -
          Runping Qi made changes -
          Field Original Value New Value
          Assignee Runping Qi [ runping ]
          Runping Qi made changes -
          Attachment abacus.patch [ 12349218 ]
          Hide
          Runping Qi added a comment -

          The attached patch contains the package for Hadoop Abacus

          Show
          Runping Qi added a comment - The attached patch contains the package for Hadoop Abacus
          Runping Qi made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1, because http://issues.apache.org/jira/secure/attachment/12349218/abacus.patch applied and successfully tested against trunk revision r497583.

          Show
          Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12349218/abacus.patch applied and successfully tested against trunk revision r497583.
          Hide
          Doug Cutting added a comment -

          This looks great!

          It would be good to add a package.html in the sources, with a description of abacus. Also the top-level build.xml should be modified so that abacus's javadoc is included as a "contrib: Abacus" group.

          Show
          Doug Cutting added a comment - This looks great! It would be good to add a package.html in the sources, with a description of abacus. Also the top-level build.xml should be modified so that abacus's javadoc is included as a "contrib: Abacus" group.
          Hide
          Doug Judd added a comment -

          One issue (or at least I assume is an issue) that I'd like to see taken care of in this toolkit is the following. You do a big crawl of a bunch of pages and want to perform a link count computation and then do a (reverse) sort by count. The problem is that the link counts follow a Zipfian distribution where there is a long tail of links of count 1 or 2. Conceptualy, you can imagine situations where you literally have 1 billion links of count 1 making it infeasible to pass into a reduce function.

          To get around this situation, I've created a TaggedLongWritable class. It contains a Long and a string tag (the tag in the above case would be the link/URL). The comparison function first compares the Long and then if they match, compares the tag. This way, you get a numeric comparison, but two keys don't match if their tags are different.

          Show
          Doug Judd added a comment - One issue (or at least I assume is an issue) that I'd like to see taken care of in this toolkit is the following. You do a big crawl of a bunch of pages and want to perform a link count computation and then do a (reverse) sort by count. The problem is that the link counts follow a Zipfian distribution where there is a long tail of links of count 1 or 2. Conceptualy, you can imagine situations where you literally have 1 billion links of count 1 making it infeasible to pass into a reduce function. To get around this situation, I've created a TaggedLongWritable class. It contains a Long and a string tag (the tag in the above case would be the link/URL). The comparison function first compares the Long and then if they match, compares the tag. This way, you get a numeric comparison, but two keys don't match if their tags are different.
          Hide
          Runping Qi added a comment -


          A updated patch is available.

          Show
          Runping Qi added a comment - A updated patch is available.
          Runping Qi made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Runping Qi made changes -
          Attachment abacus.patch [ 12349218 ]
          Runping Qi made changes -
          Attachment abacus.patch [ 12349221 ]
          Hide
          Runping Qi added a comment -

          A new patch with package.html for abacus package and
          updated build.xml including javadoc for abacus.

          Show
          Runping Qi added a comment - A new patch with package.html for abacus package and updated build.xml including javadoc for abacus.
          Runping Qi made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Runping!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Runping!
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.11.0 [ 12312257 ]
          Resolution Fixed [ 1 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s contrib/streaming [ 12310972 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HADOOP-1547 [ HADOOP-1547 ]

            People

            • Assignee:
              Runping Qi
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development