Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7572

Cache the hashcode of the doc values terms queries

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      DocValuesNumbersQuery and DocValuesTermsQuery can potentially wrap a large number of terms so it would help if we cached their hashcode.

      1. LUCENE-7572.patch
        15 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch. DocValuesTermsQuery and DocValuesNumbersQuery now compute a hashCode in constant time. DocValuesTermsQuery was also switched to PrefixCodedTerms to be more consistent with TermsQuery and DocValuesNumbersQuery switched to a hash set of primitive longs to avoid having to box all terms.

        Show
        jpountz Adrien Grand added a comment - Here is a patch. DocValuesTermsQuery and DocValuesNumbersQuery now compute a hashCode in constant time. DocValuesTermsQuery was also switched to PrefixCodedTerms to be more consistent with TermsQuery and DocValuesNumbersQuery switched to a hash set of primitive longs to avoid having to box all terms.
        Hide
        dsmiley David Smiley added a comment -

        Looks nice Adrien.

        I like the addition of LongHashSet into Lucene. Maybe it should be public and go into the util package? Did you consider simply ripping off the one from HPPC?

        Can you please remind me on why PrefixCodedTerms exists instead of an FST? Granted it's not quite the same.

        Show
        dsmiley David Smiley added a comment - Looks nice Adrien. I like the addition of LongHashSet into Lucene. Maybe it should be public and go into the util package? Did you consider simply ripping off the one from HPPC? Can you please remind me on why PrefixCodedTerms exists instead of an FST? Granted it's not quite the same.
        Hide
        jpountz Adrien Grand added a comment -

        Maybe it should be public and go into the util package?

        This set implementation is quite limited, for instance it does not support modifications after creation. I am not against making it reusable but I'd like to wait for other use-cases for this class first, maybe there won't be any.

        Did you consider simply ripping off the one from HPPC?

        That is another idea indeed. The thing that made me decide against it is that it would pull close to 1k lines of code as well as several classes from hppc like AbstractLongCollection, LongSet, etc. I think that would be a good idea if we needed a more complete set implementation on top of longs, which doesn't seem to be the case today.

        Can you please remind me on why PrefixCodedTerms exists instead of an FST?

        PrefixCodedTerms should be more compact (this might depend on the efficiency of suffix compression by the FST for the given data) and cheaper to build. On the other hand, the FST has more features, like the ability to do lookups.

        Show
        jpountz Adrien Grand added a comment - Maybe it should be public and go into the util package? This set implementation is quite limited, for instance it does not support modifications after creation. I am not against making it reusable but I'd like to wait for other use-cases for this class first, maybe there won't be any. Did you consider simply ripping off the one from HPPC? That is another idea indeed. The thing that made me decide against it is that it would pull close to 1k lines of code as well as several classes from hppc like AbstractLongCollection, LongSet, etc. I think that would be a good idea if we needed a more complete set implementation on top of longs, which doesn't seem to be the case today. Can you please remind me on why PrefixCodedTerms exists instead of an FST? PrefixCodedTerms should be more compact (this might depend on the efficiency of suffix compression by the FST for the given data) and cheaper to build. On the other hand, the FST has more features, like the ability to do lookups.
        Hide
        dsmiley David Smiley added a comment -

        RE LongHashSet: makes sense.

        RE PrefixCodedTerms: wow I didn't know it could be more compact than an FST! I thought the FST reigned supreme here. Would you mind adding some JavaDocs to PrefixCodedTerms to say this? I do see it definitely has a subset of FST features.

        I was looking at PrefixCodedTerms a bit and I see some minor things that could be improved. The presence of "delGen" seems out of place having nothing directly to do with this utility class's job; probably bolted on for BufferedUpdatesStream's convenience. And RAMFile shouldn't depend on RAMDirectory for the sole reason of buffer size increase notification; instead it could take a callback lambda. Of course these are both separate issues from this one.

        Show
        dsmiley David Smiley added a comment - RE LongHashSet: makes sense. RE PrefixCodedTerms: wow I didn't know it could be more compact than an FST! I thought the FST reigned supreme here. Would you mind adding some JavaDocs to PrefixCodedTerms to say this? I do see it definitely has a subset of FST features. I was looking at PrefixCodedTerms a bit and I see some minor things that could be improved. The presence of "delGen" seems out of place having nothing directly to do with this utility class's job; probably bolted on for BufferedUpdatesStream's convenience. And RAMFile shouldn't depend on RAMDirectory for the sole reason of buffer size increase notification; instead it could take a callback lambda. Of course these are both separate issues from this one.
        Hide
        jpountz Adrien Grand added a comment -

        Would you mind adding some JavaDocs to PrefixCodedTerms to say this?

        I will do that.

        The presence of "delGen" seems out of place having nothing directly to do with this utility class's job

        Agreed. If something needs a delGen, it should create another object that wraps a PrefixCodedTerms instance and the delGen.

        And RAMFile shouldn't depend on RAMDirectory

        Agreed too, even though that one is a bit less annoying to me since the RAMDir and the constructor that sets it are both pkg-private.

        Show
        jpountz Adrien Grand added a comment - Would you mind adding some JavaDocs to PrefixCodedTerms to say this? I will do that. The presence of "delGen" seems out of place having nothing directly to do with this utility class's job Agreed. If something needs a delGen, it should create another object that wraps a PrefixCodedTerms instance and the delGen. And RAMFile shouldn't depend on RAMDirectory Agreed too, even though that one is a bit less annoying to me since the RAMDir and the constructor that sets it are both pkg-private.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ea1569e2914f9ba914b582a0801d6cb83a29529b in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea1569e ]

        LUCENE-7572: Cache the hash code of doc values queries.

        Show
        jira-bot ASF subversion and git services added a comment - Commit ea1569e2914f9ba914b582a0801d6cb83a29529b in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea1569e ] LUCENE-7572 : Cache the hash code of doc values queries.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 9a72bd871ec684f186c7818ff1582fc1d1fe5f3f in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9a72bd8 ]

        LUCENE-7572: Cache the hash code of doc values queries.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 9a72bd871ec684f186c7818ff1582fc1d1fe5f3f in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9a72bd8 ] LUCENE-7572 : Cache the hash code of doc values queries.

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development