Mahout
  1. Mahout
  2. MAHOUT-253

Proposal for high performance primitive collections.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Not a Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Integration
    • Labels:
      None

      Description

      A proposal for template-driven collections library (lists, sets, maps, deques), with specializations for Java primitive types to save memory and increase performance. The "templates" are regular Java classes written with generics and certain "intrinsics", that is blocks replaceable by a regexp-preprocessor. This lets one write the code once, immediately test it (tests are also templates) and generate primitive versions from a single source.

      An additional interesting part is the benchmarking subsystem written on top of JUnit

      There are major differences from the Java Collections API, most notably no interfaces and interface-compatible views over sub-collections or key/value sets. These classes also expose their internal implementation (buffers, addressing, etc.) so that the code can be optimized for a particular use case.
      These motivations are further discussed here, together with an API overview.

      http://www.carrot-search.com/download/hppc/index.html

      I am curious what you think about it. If folks like it, Carrot Search will donate the code to Mahout (or Apache Commons-?) and will maintain it (because we plan to use it in our internal projects anyway).

      1. hppc-1.0-dev.zip
        280 kB
        Dawid Weiss

        Activity

        Hide
        Dawid Weiss added a comment -

        We compared HPPC against a few other collection libraries and think the experienced performance gain is due to the following factors:

        • a different hashing algorithm in hash map implementation (better key distribution); we use murmur hashing as the default hash for keys.
        • very small methods (contracts checked via assertions) inline well in loops.
        • we could rewrite red-hot performance critical code to use collection buffers directly (copying array pointers to local variables, for example).

        Some of these performance benchmarks are executed on the build server at:
        http://builds.carrot2.org/browse/HPPC-BENCHMARK-40/artifact/Benchmarks-Report

        For example this shows various iteration strategies, for instance:
        http://builds.carrot2.org/browse/HPPC-BENCHMARK-40/artifact/Benchmarks-Report/com.carrotsearch.hppc.IterationSpeedBenchmark.methods.html

        I have a presentation slide comparing performance across various libraries, but of course this may be very misleading (the results are heavily architecture and JVM dependent). You can grab the code of everything from SVN and run it on your machines (hppc-others project contains the bigram counting example).

        https://carrot2.svn.sourceforge.net/svnroot/carrot2/labs/hppc/trunk

        As for Maven, HPPC is published from our labs server:
        http://repository.carrotsearch.com/labs/releases

        All available versions are at:
        http://repository.carrotsearch.com/labs/releases/com/carrotsearch/hppc/

        Show
        Dawid Weiss added a comment - We compared HPPC against a few other collection libraries and think the experienced performance gain is due to the following factors: a different hashing algorithm in hash map implementation (better key distribution); we use murmur hashing as the default hash for keys. very small methods (contracts checked via assertions) inline well in loops. we could rewrite red-hot performance critical code to use collection buffers directly (copying array pointers to local variables, for example). Some of these performance benchmarks are executed on the build server at: http://builds.carrot2.org/browse/HPPC-BENCHMARK-40/artifact/Benchmarks-Report For example this shows various iteration strategies, for instance: http://builds.carrot2.org/browse/HPPC-BENCHMARK-40/artifact/Benchmarks-Report/com.carrotsearch.hppc.IterationSpeedBenchmark.methods.html I have a presentation slide comparing performance across various libraries, but of course this may be very misleading (the results are heavily architecture and JVM dependent). You can grab the code of everything from SVN and run it on your machines (hppc-others project contains the bigram counting example). https://carrot2.svn.sourceforge.net/svnroot/carrot2/labs/hppc/trunk As for Maven, HPPC is published from our labs server: http://repository.carrotsearch.com/labs/releases All available versions are at: http://repository.carrotsearch.com/labs/releases/com/carrotsearch/hppc/
        Hide
        Ted Dunning added a comment -

        Dawid, while you are still here, can you comment on how much performance you were able to get by using HPPC?

        Also, is HPPC available via Maven?

        Show
        Ted Dunning added a comment - Dawid, while you are still here, can you comment on how much performance you were able to get by using HPPC? Also, is HPPC available via Maven?
        Hide
        Dawid Weiss added a comment -

        I was thinking about it myself: showing that HPPC has an advantage over colt collections in the real context would prove the integration makes sense. Just throwing in (a lot) of generated code into Mahout and have Colt and HPPC collection classes duplicate each other functionality (even taking into account the difference in design and compatibility with JUC) makes little sense.

        Let's mark this issue invalid; reading the mailing list it seems the network/ Hadoop overhead is currently way larger than number crunching/ data structures. If anybody needs efficient collections, HPPC is still out there and the Apache licensed, so no major headache to import it into one's project.

        Show
        Dawid Weiss added a comment - I was thinking about it myself: showing that HPPC has an advantage over colt collections in the real context would prove the integration makes sense. Just throwing in (a lot) of generated code into Mahout and have Colt and HPPC collection classes duplicate each other functionality (even taking into account the difference in design and compatibility with JUC) makes little sense. Let's mark this issue invalid; reading the mailing list it seems the network/ Hadoop overhead is currently way larger than number crunching/ data structures. If anybody needs efficient collections, HPPC is still out there and the Apache licensed, so no major headache to import it into one's project.
        Hide
        Sean Owen added a comment -

        Colt remains partially-undigested within this codebase. My only concern is that a new mission to integrate HPPC will result in two partially-undigested libraries, both already partly duplicating some core code.

        Is this to replace Colt? if someone truly replaces all similar code and it all works, I'm completely into it.

        If the idea is just to put HPPC into Mahout to host it as an independent module, I personally am not sure that is worth the effort.

        Show
        Sean Owen added a comment - Colt remains partially-undigested within this codebase. My only concern is that a new mission to integrate HPPC will result in two partially-undigested libraries, both already partly duplicating some core code. Is this to replace Colt? if someone truly replaces all similar code and it all works, I'm completely into it. If the idea is just to put HPPC into Mahout to host it as an independent module, I personally am not sure that is worth the effort.
        Hide
        Dawid Weiss added a comment -

        @Sean: Colt collections provide an implementation of Java util collections, HPPC has a different design (method names are mostly the same though). In our benchmarks HPPC turns to be faster for large number-crunching tasks; for small collections or operation counts the difference is small.

        Show
        Dawid Weiss added a comment - @Sean: Colt collections provide an implementation of Java util collections, HPPC has a different design (method names are mostly the same though). In our benchmarks HPPC turns to be faster for large number-crunching tasks; for small collections or operation counts the difference is small.
        Hide
        Dawid Weiss added a comment -

        I'm here, I'm here. Apologies for not following the core development, various reasons behind it. I also wanted to wait for the HPPC API to stabilize a bit and to get tested in real production scenarios. I believe this part is over – we replaced PCJ with HPPC in our code and it works very well, especially the open internals give a lot of freedom in performance-critical loops. I realize open internals may become a problem should anything change, but I think weighting the options it was worth it.

        So... I'd love to contribute HPPC to Mahout and proceed with code maintenance as part of Mahout. The current version of HPPC (0.3.1) is stable. There are 4 JIRA issues in our bug tracking system, but they are features, not bugs. How do I proceed (and are folks still interested in integrating this into Mahout)?

        Show
        Dawid Weiss added a comment - I'm here, I'm here. Apologies for not following the core development, various reasons behind it. I also wanted to wait for the HPPC API to stabilize a bit and to get tested in real production scenarios. I believe this part is over – we replaced PCJ with HPPC in our code and it works very well, especially the open internals give a lot of freedom in performance-critical loops. I realize open internals may become a problem should anything change, but I think weighting the options it was worth it. So... I'd love to contribute HPPC to Mahout and proceed with code maintenance as part of Mahout. The current version of HPPC (0.3.1) is stable. There are 4 JIRA issues in our bug tracking system, but they are features, not bugs. How do I proceed (and are folks still interested in integrating this into Mahout)?
        Hide
        Benson Margulies added a comment -

        Sean,

        We had a lot of discussion of this, and, when last seen, there was a thought that the Carrot/hppc code would be better than what we got from Colt. However, we haven't heard from Dawid in quite a while. Perhaps this would get his attention.

        Show
        Benson Margulies added a comment - Sean, We had a lot of discussion of this, and, when last seen, there was a thought that the Carrot/hppc code would be better than what we got from Colt. However, we haven't heard from Dawid in quite a while. Perhaps this would get his attention.
        Hide
        Sean Owen added a comment -

        Am I right that this is a WontFix / AlreadyDidSomethingSimilar? we have all those Colt classes for a similar purpose. Reopen if I'm wrong.

        Show
        Sean Owen added a comment - Am I right that this is a WontFix / AlreadyDidSomethingSimilar? we have all those Colt classes for a similar purpose. Reopen if I'm wrong.

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development