Lucene - Core
  1. Lucene - Core
  2. LUCENE-3443

Port 3.x FieldCache.getDocsWithField() to trunk

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      [Spinoff from LUCENE-3390]

      I think the approach in 3.x for handling un-valued docs, and making it
      possible to specify how such docs are sorted, is better than the
      solution we have in trunk.

      I like that FC has a dedicated method to get the Bits for docs with field
      – easy for apps to directly use. And I like that the
      bits have their own entry in the FC.

      One downside is that it's 2 passes to get values and valid bits, but
      I think we can fix this by passing optional bool to FC.getXXX methods
      indicating you want the bits, and the populate the FC entry for the
      missing bits as well. (We can do that for 3.x and trunk). Then it's
      single pass.

      1. LUCENE-3443.patch
        240 kB
        Michael McCandless
      2. LUCENE-3443.patch
        195 kB
        Robert Muir
      3. LUCENE-3443.patch
        194 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          +1

          A separate cache entry for Bits is a good thing assuming it can be calculated in a single pass.

          Show
          Ryan McKinley added a comment - +1 A separate cache entry for Bits is a good thing assuming it can be calculated in a single pass.
          Hide
          Michael McCandless added a comment -

          Initial patch, cutting over to 3.x's approach.

          I still need to check the patch a bit more closely.... but tests do pass.

          I think we should commit this, first, then separately optimize the valid bits to single pass....

          Show
          Michael McCandless added a comment - Initial patch, cutting over to 3.x's approach. I still need to check the patch a bit more closely.... but tests do pass. I think we should commit this, first, then separately optimize the valid bits to single pass....
          Hide
          Yonik Seeley added a comment -

          I think we should commit this, first, then separately optimize the valid bits to single pass....

          That represents a pretty significant performance regression though, and it seems we should try to avoid stuff like that on trunk (better in a branch if one is going to remove functionality with the idea of adding it back later). Or we could generate the bits by default - the extra cache entry if not needed is less serious than doubling the generation time.

          Another small piece we are losing from trunk is numUniqueTerms.
          What about returning an object, so instead of getLongs() returning a long[], it would return a LongValues that had
          a long[] as well as numUniqueTerms, and docsWithValues (optionally set). This also halves the number of cache lookups needed when using sortMissinglast and supplies a place to put more info in the future (stuff like numUniqueTerms).

          Show
          Yonik Seeley added a comment - I think we should commit this, first, then separately optimize the valid bits to single pass.... That represents a pretty significant performance regression though, and it seems we should try to avoid stuff like that on trunk (better in a branch if one is going to remove functionality with the idea of adding it back later). Or we could generate the bits by default - the extra cache entry if not needed is less serious than doubling the generation time. Another small piece we are losing from trunk is numUniqueTerms. What about returning an object, so instead of getLongs() returning a long[], it would return a LongValues that had a long[] as well as numUniqueTerms, and docsWithValues (optionally set). This also halves the number of cache lookups needed when using sortMissinglast and supplies a place to put more info in the future (stuff like numUniqueTerms).
          Hide
          Michael McCandless added a comment -

          That represents a pretty significant performance regression though, and it seems we should try to avoid stuff like that on trunk (better in a branch if one is going to remove functionality with the idea of adding it back later).

          Well we haven't released trunk yet, and this is not a perf regression
          vs 3.4.0?

          Ie, there's no guarantee w/in trunk that performance won't "change",
          just like there's no guarantee APIs and index format won't "change".

          Or we could generate the bits by default - the extra cache entry if not needed is less serious than doubling the generation time.

          I'd rather not do that; apps that don't use sort missing first/last
          shouldn't be forced to spend the RAM (even if it's only a bit per
          doc).

          I think we can find a solution that makes it optional; I just don't
          think we have to do it here, now.

          This issue can focus on getting to 3.x's FC impl.

          Another small piece we are losing from trunk is numUniqueTerms.

          Well, we track this in FC (only in trunk) today, but it's not used
          anywhere, I think? Also, this is redundant with
          Terms.getUniqueTermCount()?

          What about returning an object, so instead of getLongs() returning a long[], it would return a LongValues that had
          a long[] as well as numUniqueTerms, and docsWithValues (optionally set). This also halves the number of cache lookups needed when using sortMissinglast and supplies a place to put more info in the future (stuff like numUniqueTerms).

          I think for now we should stick w/ simple arrays for this issue; we
          can explore returning an object under a new issue. The extra cache
          lookup is really a negligible cost.

          Show
          Michael McCandless added a comment - That represents a pretty significant performance regression though, and it seems we should try to avoid stuff like that on trunk (better in a branch if one is going to remove functionality with the idea of adding it back later). Well we haven't released trunk yet, and this is not a perf regression vs 3.4.0? Ie, there's no guarantee w/in trunk that performance won't "change", just like there's no guarantee APIs and index format won't "change". Or we could generate the bits by default - the extra cache entry if not needed is less serious than doubling the generation time. I'd rather not do that; apps that don't use sort missing first/last shouldn't be forced to spend the RAM (even if it's only a bit per doc). I think we can find a solution that makes it optional; I just don't think we have to do it here, now. This issue can focus on getting to 3.x's FC impl. Another small piece we are losing from trunk is numUniqueTerms. Well, we track this in FC (only in trunk) today, but it's not used anywhere, I think? Also, this is redundant with Terms.getUniqueTermCount()? What about returning an object, so instead of getLongs() returning a long[], it would return a LongValues that had a long[] as well as numUniqueTerms, and docsWithValues (optionally set). This also halves the number of cache lookups needed when using sortMissinglast and supplies a place to put more info in the future (stuff like numUniqueTerms). I think for now we should stick w/ simple arrays for this issue; we can explore returning an object under a new issue. The extra cache lookup is really a negligible cost.
          Hide
          Yonik Seeley added a comment -

          Ie, there's no guarantee w/in trunk that performance won't "change",

          Of course not. I'm arguing a specific point though: that we shouldn't double time time it takes solr users to instantiate field caches (or other lucene users that regularly use sortMissingLast). "we can do it later" is not a good argument to remove existing functionality.

          > Or we could generate the bits by default - the extra cache entry if not needed is less serious than doubling the generation time.

          I mean, generate the bits by default until the two-pass problem is fixed (not forever).

          I'd rather not do that; apps that don't use sort missing first/last shouldn't be forced to spend the RAM (even if it's only a bit per doc).

          But it's trunk, right? There's no guarantee w/in trunk that performance won't "change"
          Using 1/32 more memory (3%) is way less evil than doubling field cache entry times.

          Also, this (uniqueTermCount) is redundant with Terms.getUniqueTermCount()?

          Not for numeric fields, where a precisionStep causes nTerms > nUniqueValues

          Show
          Yonik Seeley added a comment - Ie, there's no guarantee w/in trunk that performance won't "change", Of course not. I'm arguing a specific point though: that we shouldn't double time time it takes solr users to instantiate field caches (or other lucene users that regularly use sortMissingLast). "we can do it later" is not a good argument to remove existing functionality. > Or we could generate the bits by default - the extra cache entry if not needed is less serious than doubling the generation time. I mean, generate the bits by default until the two-pass problem is fixed (not forever). I'd rather not do that; apps that don't use sort missing first/last shouldn't be forced to spend the RAM (even if it's only a bit per doc). But it's trunk, right? There's no guarantee w/in trunk that performance won't "change" Using 1/32 more memory (3%) is way less evil than doubling field cache entry times. Also, this (uniqueTermCount) is redundant with Terms.getUniqueTermCount()? Not for numeric fields, where a precisionStep causes nTerms > nUniqueValues
          Hide
          Michael McCandless added a comment -

          I'm arguing a specific point though: that we shouldn't double time time it takes solr users to instantiate field caches (or other lucene users that regularly use sortMissingLast). "we can do it later" is not a good argument to remove existing functionality.

          We're not doubling the time vs 3.x.

          I'd rather not do that; apps that don't use sort missing first/last shouldn't be forced to spend the RAM (even if it's only a bit per doc).

          But it's trunk, right? There's no guarantee w/in trunk that performance won't "change"

          Using 1/32 more memory (3%) is way less evil than doubling field cache entry times.

          Again, compared to 3.x, that'd be worse. Ie, w/ the patch as it now
          stands we could ship 4.0 and we'd match 3.x.

          For a byte[] FC entry that's 12.5% extra RAM.

          I think one-time FC init cost is the lesser evil vs "permanently"
          tying up unused RAM.

          Also, this (uniqueTermCount) is redundant with Terms.getUniqueTermCount()?

          Not for numeric fields, where a precisionStep causes nTerms > nUniqueValues

          Ahh, true.

          Show
          Michael McCandless added a comment - I'm arguing a specific point though: that we shouldn't double time time it takes solr users to instantiate field caches (or other lucene users that regularly use sortMissingLast). "we can do it later" is not a good argument to remove existing functionality. We're not doubling the time vs 3.x. I'd rather not do that; apps that don't use sort missing first/last shouldn't be forced to spend the RAM (even if it's only a bit per doc). But it's trunk, right? There's no guarantee w/in trunk that performance won't "change" Using 1/32 more memory (3%) is way less evil than doubling field cache entry times. Again, compared to 3.x, that'd be worse. Ie, w/ the patch as it now stands we could ship 4.0 and we'd match 3.x. For a byte[] FC entry that's 12.5% extra RAM. I think one-time FC init cost is the lesser evil vs "permanently" tying up unused RAM. Also, this (uniqueTermCount) is redundant with Terms.getUniqueTermCount()? Not for numeric fields, where a precisionStep causes nTerms > nUniqueValues Ahh, true.
          Hide
          Yonik Seeley added a comment -

          Just because the functionality isn't in 3.x doesn't mean we should feel free to rip it out of trunk and replace it with something that has less functionality (well, we should be able to if we think it's for the best, but my point is... it's not.) IMO, this patch should be further developed as a patch until the 2-pass problem has been solved as well. Unless there's a good reason, we shouldn't be moving backwards.

          Show
          Yonik Seeley added a comment - Just because the functionality isn't in 3.x doesn't mean we should feel free to rip it out of trunk and replace it with something that has less functionality (well, we should be able to if we think it's for the best, but my point is... it's not.) IMO, this patch should be further developed as a patch until the 2-pass problem has been solved as well. Unless there's a good reason, we shouldn't be moving backwards.
          Hide
          Robert Muir added a comment -

          I like this patch (it makes the fieldcache less confusing to me, currently I only understand the 3.x one).

          I updated the patch with a tiny optimization, maybe it can help resolve some concerns. I see that solr defaults its "string" field etc to sortMissingLast.

          I think its very common that many fields are populated for every document (of course this option is totally stupid in this case...), but regardless of stupid configurations we shouldn't do any additional pass there:

                  // if the field is fully populated, don't bother enumerating
                  if (terms.getDocCount() == maxDoc) {
                    return new Bits.MatchAllBits(maxDoc);
                  }
          
          Show
          Robert Muir added a comment - I like this patch (it makes the fieldcache less confusing to me, currently I only understand the 3.x one). I updated the patch with a tiny optimization, maybe it can help resolve some concerns. I see that solr defaults its "string" field etc to sortMissingLast. I think its very common that many fields are populated for every document (of course this option is totally stupid in this case...), but regardless of stupid configurations we shouldn't do any additional pass there: // if the field is fully populated, don't bother enumerating if (terms.getDocCount() == maxDoc) { return new Bits.MatchAllBits(maxDoc); }
          Hide
          Michael McCandless added a comment -

          New patch.

          I managed to fold in setting the docsWithField bits with a single pass. You pass an extra boolean to each getXXX.

          I think it's ready!

          Show
          Michael McCandless added a comment - New patch. I managed to fold in setting the docsWithField bits with a single pass. You pass an extra boolean to each getXXX. I think it's ready!
          Hide
          Yonik Seeley added a comment -

          Looks good! +1 to commit.

          Something we should consider for the future is expanding the functionality of the Parser class.
          If the parser could handle the creation of the whole entry, we could get rid of hacks like throwing+catching an exception for the NumericField parsers and enable doing other things like populating a FieldCache entry from a CSF.

          Show
          Yonik Seeley added a comment - Looks good! +1 to commit. Something we should consider for the future is expanding the functionality of the Parser class. If the parser could handle the creation of the whole entry, we could get rid of hacks like throwing+catching an exception for the NumericField parsers and enable doing other things like populating a FieldCache entry from a CSF.
          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development