Solr
  1. Solr
  2. SOLR-7787

Fork HyperLogLog and remove fastutil dependency

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      So fastutil is now part of Solr's distribution (because the stats component uses hyperloglog library, which in turn requires fastutil). I looked at the actual uses of fastutil and only java-hll uses it (and only a few classes).

      I've created a fork that uses HPPC instead (also randomized all tests, they pass). Since it's a relatively simple package I think it could be forked and imported into Solr's codebase entirely. I'd make a pull request but I see Hoss also created a few comments/ PRs and none of them received any attention; the project seems to be stale or dead?

      1. SOLR-7787.patch
        259 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          I don't have any strong opinions about this issue as it currently stands (ie: "Fork HyperLogLog and remove fastutil dependency") – as you mentioned, the java-hll project, with it's original goals of a cross langauge binary format protocol, doesn't appear to be very active anymore – but for the record can you please clarify the context/objective of this issue?

          It started out as "Promote fastutil to first-order dependency" – and your goal seemed to be to remove HPPC, now you seem to have flipped the objective so that the focus is on keeping HPPC, and eliminating fastutil.

          having a clear understanding of "the goal" for either refactoring/forking/etc would be helpful for folks to develop an informed opinion.

          Show
          Hoss Man added a comment - I don't have any strong opinions about this issue as it currently stands (ie: "Fork HyperLogLog and remove fastutil dependency") – as you mentioned, the java-hll project, with it's original goals of a cross langauge binary format protocol, doesn't appear to be very active anymore – but for the record can you please clarify the context/objective of this issue? It started out as "Promote fastutil to first-order dependency" – and your goal seemed to be to remove HPPC, now you seem to have flipped the objective so that the focus is on keeping HPPC, and eliminating fastutil. having a clear understanding of "the goal" for either refactoring/forking/etc would be helpful for folks to develop an informed opinion.
          Hide
          Dawid Weiss added a comment -

          Well, I thought lack of initial feedback was lazy consensus

          And seriously – it can go either way. My original intention was to update HPPC which is duplicated in Solr and the clustering contrib. These have to be consistent. From there I observed that:

          1) Solr uses HPPC in a small number of classes,
          2) fastutil is present in solr's lib, but it's not used in any classes, it is just a transitive dependency from hll,
          3) fastutil is much larger than HPPC (roughly 15x).

          I am really ok with any option. I decided to cut fastutil out because it's just a much larger library (of which almost nothing is practically used in Solr). Also, hll's implementation uses fastutil iterators very inefficiently (causing intermediate autoboxing on every value) so I thought I'd take a stab at improving that while converting to HPPC.

          I can also leave everything as-is, really, but the patch I have locally seems like a nice improvement on its own (and could be perhaps pruned further to get rid of unnecessary stuff like serialization, etc.).

          Show
          Dawid Weiss added a comment - Well, I thought lack of initial feedback was lazy consensus And seriously – it can go either way. My original intention was to update HPPC which is duplicated in Solr and the clustering contrib. These have to be consistent. From there I observed that: 1) Solr uses HPPC in a small number of classes, 2) fastutil is present in solr's lib, but it's not used in any classes, it is just a transitive dependency from hll, 3) fastutil is much larger than HPPC (roughly 15x). I am really ok with any option. I decided to cut fastutil out because it's just a much larger library (of which almost nothing is practically used in Solr). Also, hll's implementation uses fastutil iterators very inefficiently (causing intermediate autoboxing on every value) so I thought I'd take a stab at improving that while converting to HPPC. I can also leave everything as-is, really, but the patch I have locally seems like a nice improvement on its own (and could be perhaps pruned further to get rid of unnecessary stuff like serialization, etc.).
          Hide
          David Smiley added a comment -

          I think we should avoid very large dependencies that we so barely use. +1 to Fork HLL so that we needn't use FastUtil.

          Perhaps it would be useful to fork HLL publicly (e.g. on GitHub & publish to maven) to be its own 3rd party dependency – I imagine others aren't too keen on using a 17MB library that they aren't already using. Ideally this would be condoned with the original library owner to use their maven groupId but differentiated in the artifactId, but it isn't necessary of course.

          Show
          David Smiley added a comment - I think we should avoid very large dependencies that we so barely use. +1 to Fork HLL so that we needn't use FastUtil. Perhaps it would be useful to fork HLL publicly (e.g. on GitHub & publish to maven) to be its own 3rd party dependency – I imagine others aren't too keen on using a 17MB library that they aren't already using. Ideally this would be condoned with the original library owner to use their maven groupId but differentiated in the artifactId, but it isn't necessary of course.
          Hide
          Upayavira added a comment -

          I can see the rationale of forking. I wonder though would hll accept a patch to remove the dependency (or make it optional)? Thus removing the need for us to maintain a fork?

          Show
          Upayavira added a comment - I can see the rationale of forking. I wonder though would hll accept a patch to remove the dependency (or make it optional)? Thus removing the need for us to maintain a fork?
          Hide
          Hoss Man added a comment -

          My original intention was to update HPPC which is duplicated in Solr and the clustering contrib. These have to be consistent. From there I observed that:

          ...

          I am really ok with any option. I decided to cut fastutil out because it's just a much larger library (of which almost nothing is practically used in Solr). Also, hll's implementation uses fastutil iterators very inefficiently (causing intermediate autoboxing on every value) so I thought I'd take a stab at improving that while converting to HPPC.

          Got it – so the primary concern is/was dependency version consistency, and in the process of going down that rabbit hole, you realized forking HLL to eliminate the dep on fastutil seemed like the best win all around.

          Sounds like a good plan of attack to me.

          Show
          Hoss Man added a comment - My original intention was to update HPPC which is duplicated in Solr and the clustering contrib. These have to be consistent. From there I observed that: ... I am really ok with any option. I decided to cut fastutil out because it's just a much larger library (of which almost nothing is practically used in Solr). Also, hll's implementation uses fastutil iterators very inefficiently (causing intermediate autoboxing on every value) so I thought I'd take a stab at improving that while converting to HPPC. Got it – so the primary concern is/was dependency version consistency, and in the process of going down that rabbit hole, you realized forking HLL to eliminate the dep on fastutil seemed like the best win all around. Sounds like a good plan of attack to me.
          Hide
          Dawid Weiss added a comment -

          Got it – so the primary concern is/was dependency version consistency, and in the process of going down that rabbit hole, you realized forking HLL to eliminate the dep on fastutil seemed like the best win all around.

          Yes, that's exactly what happened.

          I wonder though would hll accept a patch to remove the dependency (or make it optional)?

          It's not a complete removal – I simply replaced fastutil with HPPC (since Solr already uses it in a couple of places anyway). I also did a few other changes the author may not be too happy with (replaced testng with junit, replaced hardcoded randomization seeds with randomizedtesting, etc.) in preparation of importing the code into Solr's codebase. I will submit a PR too, but there seems to be a general lack of interest in this code from the original author, see Hoss's unaddressed question here:

          https://github.com/aggregateknowledge/java-hll/issues/15

          and Timon Karnezos doesn't seem to be too active recently: https://github.com/timonk?tab=contributions&period=monthly

          Show
          Dawid Weiss added a comment - Got it – so the primary concern is/was dependency version consistency, and in the process of going down that rabbit hole, you realized forking HLL to eliminate the dep on fastutil seemed like the best win all around. Yes, that's exactly what happened. I wonder though would hll accept a patch to remove the dependency (or make it optional)? It's not a complete removal – I simply replaced fastutil with HPPC (since Solr already uses it in a couple of places anyway). I also did a few other changes the author may not be too happy with (replaced testng with junit, replaced hardcoded randomization seeds with randomizedtesting, etc.) in preparation of importing the code into Solr's codebase. I will submit a PR too, but there seems to be a general lack of interest in this code from the original author, see Hoss's unaddressed question here: https://github.com/aggregateknowledge/java-hll/issues/15 and Timon Karnezos doesn't seem to be too active recently: https://github.com/timonk?tab=contributions&period=monthly
          Hide
          Dawid Weiss added a comment -

          This is the PR, let's see if it receives any attention. Otherwise I'll try to clean it up from unnecessary stuff and import it directly into Solr's codebase.

          https://github.com/aggregateknowledge/java-hll/pull/16

          Show
          Dawid Weiss added a comment - This is the PR, let's see if it receives any attention. Otherwise I'll try to clean it up from unnecessary stuff and import it directly into Solr's codebase. https://github.com/aggregateknowledge/java-hll/pull/16
          Hide
          ASF subversion and git services added a comment -

          Commit 1691350 from Dawid Weiss in branch 'dev/branches/solr7787'
          [ https://svn.apache.org/r1691350 ]

          SOLR-7787 (jhll integration).

          Show
          ASF subversion and git services added a comment - Commit 1691350 from Dawid Weiss in branch 'dev/branches/solr7787' [ https://svn.apache.org/r1691350 ] SOLR-7787 (jhll integration).
          Hide
          Dawid Weiss added a comment -

          Attaching a patch that imports java-hll code into Solr, removes the dependency on fastutil and jhll, cleans up the code a bit in a few places where we restrict API use.

          Show
          Dawid Weiss added a comment - Attaching a patch that imports java-hll code into Solr, removes the dependency on fastutil and jhll, cleans up the code a bit in a few places where we restrict API use.
          Hide
          Yonik Seeley added a comment -

          When HLL was first added, I remember scanning the code quickly and noting that the explicit storage rehashed the given value (which is wasted effort given that the provided values already need to be good hashes to start with). Is there a way to avoid that? Prob not a big deal though.

          Show
          Yonik Seeley added a comment - When HLL was first added, I remember scanning the code quickly and noting that the explicit storage rehashed the given value (which is wasted effort given that the provided values already need to be good hashes to start with). Is there a way to avoid that? Prob not a big deal though.
          Hide
          Dawid Weiss added a comment -

          Yeah, I honestly don't think those storage optimizations are worth the effort either, but I think the code can be tweaked after it's integrated – I wouldn't want to tinker with it in this patch (can you file another issue)?

          Show
          Dawid Weiss added a comment - Yeah, I honestly don't think those storage optimizations are worth the effort either, but I think the code can be tweaked after it's integrated – I wouldn't want to tinker with it in this patch (can you file another issue)?
          Hide
          ASF subversion and git services added a comment -

          Commit 1691518 from Dawid Weiss in branch 'dev/trunk'
          [ https://svn.apache.org/r1691518 ]

          SOLR-7787: Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core.

          Show
          ASF subversion and git services added a comment - Commit 1691518 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1691518 ] SOLR-7787 : Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core.
          Hide
          ASF subversion and git services added a comment -

          Commit 1691531 from Dawid Weiss in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1691531 ]

          SOLR-7787: Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core.

          Show
          ASF subversion and git services added a comment - Commit 1691531 from Dawid Weiss in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691531 ] SOLR-7787 : Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core.
          Hide
          ASF subversion and git services added a comment -

          Commit 1691535 from Dawid Weiss in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1691535 ]

          SOLR-7787: Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core. (committed from solr/ folder; note to Uwe: this is why I hate SVN...).

          Show
          ASF subversion and git services added a comment - Commit 1691535 from Dawid Weiss in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691535 ] SOLR-7787 : Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core. (committed from solr/ folder; note to Uwe: this is why I hate SVN...).
          Hide
          ASF subversion and git services added a comment -

          Commit 1691541 from Dawid Weiss in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1691541 ]

          SOLR-7787: Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core. (committed from solr/ folder; note to Uwe: this is why I hate SVN...).

          Show
          ASF subversion and git services added a comment - Commit 1691541 from Dawid Weiss in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691541 ] SOLR-7787 : Removed fastutil and java-hll dependency, integrated HyperLogLog from java-hll into Solr core. (committed from solr/ folder; note to Uwe: this is why I hate SVN...).
          Hide
          ASF subversion and git services added a comment -

          Commit 1691609 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1691609 ]

          SOLR-7787: Remove obsolute fastutil and hll files in solr/licenses/

          Show
          ASF subversion and git services added a comment - Commit 1691609 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1691609 ] SOLR-7787 : Remove obsolute fastutil and hll files in solr/licenses/
          Hide
          ASF subversion and git services added a comment -

          Commit 1691610 from Steve Rowe in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1691610 ]

          SOLR-7787: Remove obsolute fastutil and hll files in solr/licenses/ (merged trunk r1691609)

          Show
          ASF subversion and git services added a comment - Commit 1691610 from Steve Rowe in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691610 ] SOLR-7787 : Remove obsolute fastutil and hll files in solr/licenses/ (merged trunk r1691609)
          Hide
          Dawid Weiss added a comment -

          Darn, my bad – thanks Steve Rowe!

          Show
          Dawid Weiss added a comment - Darn, my bad – thanks Steve Rowe !
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close for 5.3.0 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Dawid Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development