Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-14365

CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 8.4.1
    • Fix Version/s: 8.6, main (9.0)
    • Component/s: None
    • Labels:
      None

      Description

      Since Collapsing is a PostFilter, documents reach Collapsing must match with all filters and queries, so the number of documents Collapsing need to collect/compute score is a small fraction of the total number documents in the index. So why do we need to always consume the memory (for int[] and float[] array) for all unique values of the collapsed field? If the number of unique values of the collapsed field found in the documents that match queries and filters is 300 then we only need int[] and float[] array with size of 300 and not 1.2 million in size. However, we don't know which value of the collapsed field will show up in the results so we cannot use a smaller array.

      The easy fix for this problem is using as much as we need by using IntIntMap and IntFloatMap that hold primitives and are much more space efficient than the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and float[] if matched documents is large (almost all documents matched queries and other filters). But our belief is that does not happen that frequently (how frequently do we run collapsing on the entire index?).

      For this issue I propose adding 2 methods for collapsing which is

      • array : which is current implementation
      • hash : which is new approach and will be default method
        later we can add another method smart which is automatically pick method based on comparision between number of docs matched queries and filters and number of unique values of the field

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              caomanhdat Cao Manh Dat
              Reporter:
              caomanhdat Cao Manh Dat

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 8h 10m
                8h 10m

                  Issue deployment