Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-505

MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Patch is against Lucene 1.9 trunk (as of Mar 1 06)

      Description

      MultiReader.norms() is very inefficient: it has to construct a byte array that's as long as all the documents in every segment. This doubles the memory requirement for scoring MultiReaders vs. Segment Readers. Although this is cached, it's still a baseline of memory that is unnecessary.

      The problem is that the Normalization Factors are passed around as a byte[]. If it were instead replaced with an Object, you could perform a whole host of optimizations
      a. When reading, you wouldn't have to construct a "fakeNorms" array of all 1.0fs. You could instead return a singleton object that would just return 1.0f.
      b. MultiReader could use an object that could delegate to NormFactors of the subreaders
      c. You could write an implementation that could use mmap to access the norm factors. Or if the index isn't long lived, you could use an implementation that reads directly from the disk.

      The patch provided here replaces the use of byte[] with a new abstract class called NormFactors.
      NormFactors has two methods on it
      public abstract byte getByte(int doc) throws IOException; // Returns the byte[doc]
      public float getFactor(int doc) throws IOException; // Calls Similarity.decodeNorm(getByte(doc))

      There are four implementations of this abstract class
      1. NormFactors.EmptyNormFactors - This replaces the fakeNorms with a singleton that only returns 1.0
      2. NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for backwards compatibility in constructors.
      3. MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent the need to construct the gigantic norms array.
      4. SegmentReader.Norm - Same class, but now extends NormFactors to provide the same access.

      In addition, Many of the Query and Scorer classes were changes to pass around NormFactors instead of byte[], and to call getFactor() instead of using the byte[]. I have kept around IndexReader.norms(String) for backwards compatibiltiy, but marked it as deprecated. I believe that the use of ByteNormFactors in IndexReader.getNormFactors() will keep backward compatibility with other IndexReader implementations, but I don't know how to test that.

      1. LazyNorms.patch
        1 kB
        Steven Tamm
      2. NormFactors.patch
        26 kB
        Steven Tamm
      3. NormFactors.patch
        48 kB
        Steven Tamm
      4. NormFactors20.patch
        21 kB
        Steven Tamm

        Activity

        Hide
        tamm Steven Tamm added a comment -

        This patch doesn't include my previous change to TermScorer. It passes all of the lucene unit tests in addition to our set of tests.

        Show
        tamm Steven Tamm added a comment - This patch doesn't include my previous change to TermScorer. It passes all of the lucene unit tests in addition to our set of tests.
        Hide
        tamm Steven Tamm added a comment -

        Sorry, I didn't remove whitespace in the previous patch. This one's easier to read.

        svn diff --diff-cmd diff -x "-b -u" works better than svn diff --diff-cmd diff -x -b -x -u

        Show
        tamm Steven Tamm added a comment - Sorry, I didn't remove whitespace in the previous patch. This one's easier to read. svn diff --diff-cmd diff -x "-b -u" works better than svn diff --diff-cmd diff -x -b -x -u
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        > MultiReader.norms() is very inefficient: it has to construct a byte array that's as long as all the documents in every
        > segment. This doubles the memory requirement for scoring MultiReaders vs. Segment Readers.

        Are you positive? It shouldn't. MultiReader.norms(field) does not call subReader.norms(field), it calls
        norms(String field, byte[] result, int offset) that puts the results directly in the norm array without causing it to be cached in the subReader.

        Of course if you call norms() on both the MultiReader and subReaders yourself, then things will be doubly cached.

        What was the performance impact of your patches?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - > MultiReader.norms() is very inefficient: it has to construct a byte array that's as long as all the documents in every > segment. This doubles the memory requirement for scoring MultiReaders vs. Segment Readers. Are you positive? It shouldn't. MultiReader.norms(field) does not call subReader.norms(field), it calls norms(String field, byte[] result, int offset) that puts the results directly in the norm array without causing it to be cached in the subReader. Of course if you call norms() on both the MultiReader and subReaders yourself, then things will be doubly cached. What was the performance impact of your patches?
        Hide
        tamm Steven Tamm added a comment -

        I made the change less for MultiReader, but to prevent the instantiation of the fakeNorms array (which is an extra 1MB of useless memory for us). In addition, we don't have long lived indexes, so keeping the index loading memory consumption down is critical. And being able to avoid all byte[] in the future is a necessity for us.

        You are correct that it won't help MultiReader.norms() that much, unless you are also calling doSetNorm (where upon you get the double instantiation, since the subreader will cache its norms as well).

        Show
        tamm Steven Tamm added a comment - I made the change less for MultiReader, but to prevent the instantiation of the fakeNorms array (which is an extra 1MB of useless memory for us). In addition, we don't have long lived indexes, so keeping the index loading memory consumption down is critical. And being able to avoid all byte[] in the future is a necessity for us. You are correct that it won't help MultiReader.norms() that much, unless you are also calling doSetNorm (where upon you get the double instantiation, since the subreader will cache its norms as well).
        Hide
        cutting Doug Cutting added a comment -

        I don't see how the memory requirements of MultiReader are twice that of SegmentReader. MultiReader does not call norms(String) on each sub-reader, but rather norms(String, byte[], int), storing them in a previously allocated array, so the sub-reader normally never constructs an array for its norms.

        I also worry about performance with this change. Have you benchmarked this while searching large indexes? For example, in TermScorer.score(HitCollector, int), Lucene's innermost loop, you change two array accesses into a call to an interface. That could make a substantial difference. Small changes to that method can cause significant performance changes.

        The biggest advantage of this to my eye is the removal of fakeNorms, but I think those are only rarely used, and even those uses can be eliminated. One can now omit norms when indexing, and, if such a field is searched with a normal query then fakeNorms will be used. But a ConstantScoringQuery of the field should return the same results, and faster too! So the bug to fix is that, when a query is run against a field with omitted norms it should automatically be rewritten as a ConstantScoringQuery, both for speed and to avoid allocating fakeNorms.

        Finally, a note for other committers: we should try not to deprecate anything in Lucene until we finish removing all of the methods that were deprecated in 1.9, to minimize confusion. Ideally we can avoid having anything deprecated until after 2.0 is out the door.

        Show
        cutting Doug Cutting added a comment - I don't see how the memory requirements of MultiReader are twice that of SegmentReader. MultiReader does not call norms(String) on each sub-reader, but rather norms(String, byte[], int), storing them in a previously allocated array, so the sub-reader normally never constructs an array for its norms. I also worry about performance with this change. Have you benchmarked this while searching large indexes? For example, in TermScorer.score(HitCollector, int), Lucene's innermost loop, you change two array accesses into a call to an interface. That could make a substantial difference. Small changes to that method can cause significant performance changes. The biggest advantage of this to my eye is the removal of fakeNorms, but I think those are only rarely used, and even those uses can be eliminated. One can now omit norms when indexing, and, if such a field is searched with a normal query then fakeNorms will be used. But a ConstantScoringQuery of the field should return the same results, and faster too! So the bug to fix is that, when a query is run against a field with omitted norms it should automatically be rewritten as a ConstantScoringQuery, both for speed and to avoid allocating fakeNorms. Finally, a note for other committers: we should try not to deprecate anything in Lucene until we finish removing all of the methods that were deprecated in 1.9, to minimize confusion. Ideally we can avoid having anything deprecated until after 2.0 is out the door.
        Hide
        tamm Steven Tamm added a comment -

        > I also worry about performance with this change. Have you benchmarked this while searching large indexes?
        yes. see below.

        > For example, in TermScorer.score(HitCollector, int), Lucene's innermost loop, you change two array accesses into a call to an interface. That could make a substantial difference. Small changes to that method can cause significant performance changes.

        Specifically "you change two array accesses into a call to an interface." I have changed two byte array references (one of which is static), to a method call on an abstract class. I'm using JDK 1.5.0_06. Hotspot inlines both calls and performance was about the same with a 1M docs index (we have a low term/doc ratio, so we have about 8.5M terms). HPROF doesn't even see the call to Similarity.decodeNorm. If I was using JDK 1.3, I'd probably agree with you, but HotSpot is very good at figuring this stuff out and autoinlining the calls.

        As for the numbers: an average request returning 5000 hits from our 0.5G index was at ~485ms average on my box before. It's now at ~480ms. (50 runs each). Most of that is overhead, granted.

        The increase in performance may be obscured by my other change in TermScorer (LUCENE-502). I'm not sure of the history of TermScorer, but it seems heavily optimized for a Large # Terms/Document. We have a low # Terms/Document, so performance suffers greatly.. Performance was dramatically improved by not unnecessarily caching things. TermScorer seems to be heavily optimized for a non-modern VM (like inlining next() into score(), caching result of Math.sqrt for each term being queried, having a doc/freq cache that provides no benefit unless iterating backwards, etc). The total of the term scorer changes brought the average down from ~580ms.

        Since we use a lot of large indexes and don't keep them in memory all that often, our performance increases dramatically due to the reduction in GC overhead. As we move to not actually storing the Norms array in memory but instead using the disk, this change will have an even higher benefit. I'm in the process of preparing a set of patches that will help people that don't have long-lived indexes, and this is just one part.

        Show
        tamm Steven Tamm added a comment - > I also worry about performance with this change. Have you benchmarked this while searching large indexes? yes. see below. > For example, in TermScorer.score(HitCollector, int), Lucene's innermost loop, you change two array accesses into a call to an interface. That could make a substantial difference. Small changes to that method can cause significant performance changes. Specifically "you change two array accesses into a call to an interface." I have changed two byte array references (one of which is static), to a method call on an abstract class. I'm using JDK 1.5.0_06. Hotspot inlines both calls and performance was about the same with a 1M docs index (we have a low term/doc ratio, so we have about 8.5M terms). HPROF doesn't even see the call to Similarity.decodeNorm. If I was using JDK 1.3, I'd probably agree with you, but HotSpot is very good at figuring this stuff out and autoinlining the calls. As for the numbers: an average request returning 5000 hits from our 0.5G index was at ~485ms average on my box before. It's now at ~480ms. (50 runs each). Most of that is overhead, granted. The increase in performance may be obscured by my other change in TermScorer ( LUCENE-502 ). I'm not sure of the history of TermScorer, but it seems heavily optimized for a Large # Terms/Document. We have a low # Terms/Document, so performance suffers greatly.. Performance was dramatically improved by not unnecessarily caching things. TermScorer seems to be heavily optimized for a non-modern VM (like inlining next() into score(), caching result of Math.sqrt for each term being queried, having a doc/freq cache that provides no benefit unless iterating backwards, etc). The total of the term scorer changes brought the average down from ~580ms. Since we use a lot of large indexes and don't keep them in memory all that often, our performance increases dramatically due to the reduction in GC overhead. As we move to not actually storing the Norms array in memory but instead using the disk, this change will have an even higher benefit. I'm in the process of preparing a set of patches that will help people that don't have long-lived indexes, and this is just one part.
        Hide
        tamm Steven Tamm added a comment -

        Here's a patch where if you turn LOAD_NORMS_INTO_MEM to false, it will instead load the norms from the disk all the time. When combined with LUCENE-508 (the prefetching patch), you can dramatically reduce the amount of memory generated when you have 1 query/index.

        Show
        tamm Steven Tamm added a comment - Here's a patch where if you turn LOAD_NORMS_INTO_MEM to false, it will instead load the norms from the disk all the time. When combined with LUCENE-508 (the prefetching patch), you can dramatically reduce the amount of memory generated when you have 1 query/index.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        > I made the change less for MultiReader, but to prevent the instantiation of the fakeNorms array (which is an extra 1MB of useless memory for us).
        Are you using the omitNorms feature? If not, what is causing the fakeNorms to be allocated?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - > I made the change less for MultiReader, but to prevent the instantiation of the fakeNorms array (which is an extra 1MB of useless memory for us). Are you using the omitNorms feature? If not, what is causing the fakeNorms to be allocated?
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        > One can now omit norms when indexing, and, if such a field is searched with a normal query then fakeNorms will be used.
        > But a ConstantScoringQuery of the field should return the same results, and faster too!

        If you mean ConstantScoreQuery, that doesn't currently include tf or idf, so the scores won't match.

        There is a memory related optimization that can be made in SegmentReader.norms(field,array,offset) though.
        System.arraycopy(fakeNorms(), 0, bytes, offset, maxDoc());
        could be replaced with Arrays.fill

        Show
        yseeley@gmail.com Yonik Seeley added a comment - > One can now omit norms when indexing, and, if such a field is searched with a normal query then fakeNorms will be used. > But a ConstantScoringQuery of the field should return the same results, and faster too! If you mean ConstantScoreQuery, that doesn't currently include tf or idf, so the scores won't match. There is a memory related optimization that can be made in SegmentReader.norms(field,array,offset) though. System.arraycopy(fakeNorms(), 0, bytes, offset, maxDoc()); could be replaced with Arrays.fill
        Hide
        tamm Steven Tamm added a comment -

        We're still using TermScorer, which generates the fakeNorms() regardless of omitNorms on or off. ConstantTermScorer is a step in the right direction, but you already said what I was going to say about it .

        Specifically, we have one field where we want norms, and one field where we don't. As you can see by the "LazyNorms" patch, the big bang for the buck for us (besides optimizing TermScorer.score()), was to lazily load the norms from the disk since we don't keep the indexes in memory. If you only are scoring 3 docs (ever), why should it load a 1MB array (or in the case of TermScorer, a 1MB array from disk and then an empty 1MB array). A better ConstantScoreQuery would also work, but only if the caller knew which fields had omitNorms on or off.

        Show
        tamm Steven Tamm added a comment - We're still using TermScorer, which generates the fakeNorms() regardless of omitNorms on or off. ConstantTermScorer is a step in the right direction, but you already said what I was going to say about it . Specifically, we have one field where we want norms, and one field where we don't. As you can see by the "LazyNorms" patch, the big bang for the buck for us (besides optimizing TermScorer.score()), was to lazily load the norms from the disk since we don't keep the indexes in memory. If you only are scoring 3 docs (ever), why should it load a 1MB array (or in the case of TermScorer, a 1MB array from disk and then an empty 1MB array). A better ConstantScoreQuery would also work, but only if the caller knew which fields had omitNorms on or off.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        >We're still using TermScorer, which generates the fakeNorms() regardless of omitNorms on or off.

        Let me focus on that point for the moment so we can see if there is a bug in fakeNorms or not.
        If omitNorms is off (the normal behavior of all indexed fields having norms), then fakeNorms won't ever be allocated,
        except in the case of a MultiReader calling norms() on a subreader that doesn't know about that field (because no docs had that field indexed). That allocation of fakeNorms() would also be eliminated by the Arrays.fill() fix I mentioned above.
        I'll look into that after Lucene 2.0 comes out.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - >We're still using TermScorer, which generates the fakeNorms() regardless of omitNorms on or off. Let me focus on that point for the moment so we can see if there is a bug in fakeNorms or not. If omitNorms is off (the normal behavior of all indexed fields having norms), then fakeNorms won't ever be allocated, except in the case of a MultiReader calling norms() on a subreader that doesn't know about that field (because no docs had that field indexed). That allocation of fakeNorms() would also be eliminated by the Arrays.fill() fix I mentioned above. I'll look into that after Lucene 2.0 comes out.
        Hide
        cutting Doug Cutting added a comment -

        It is not clear to me that your uses are typical uses. These optimizations were added because they made big improvements. They were not premature. In some cases JVM's may have evolved so that some of them are no longer required. But some of them may still make significant improvements for lots of users.

        I'd like to see some benchmarks from other applications before we commit big changes to such inner loops.

        Show
        cutting Doug Cutting added a comment - It is not clear to me that your uses are typical uses. These optimizations were added because they made big improvements. They were not premature. In some cases JVM's may have evolved so that some of them are no longer required. But some of them may still make significant improvements for lots of users. I'd like to see some benchmarks from other applications before we commit big changes to such inner loops.
        Hide
        tamm Steven Tamm added a comment -

        There was a bug in MultiReader.java where I wasn't handling the caches correctly, specifically in getNormFactors and doSetNorm.

        This is also smaller and updated for 2.0

        Show
        tamm Steven Tamm added a comment - There was a bug in MultiReader.java where I wasn't handling the caches correctly, specifically in getNormFactors and doSetNorm. This is also smaller and updated for 2.0
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        From my experience with LUCENE-831, changing array access to method invocation does come with a good size penalty (5-15% was what I generally saw I believe). These are not identical things, but I think are close enough to take away that there will be a good measurable size hit to such a change. On the other hand, you could do cool norms reopen stuff with the method calls (eg have each IndexReader maintain its own norms and MultiReaders sub delegate norms calls)...

        Show
        markrmiller@gmail.com Mark Miller added a comment - From my experience with LUCENE-831 , changing array access to method invocation does come with a good size penalty (5-15% was what I generally saw I believe). These are not identical things, but I think are close enough to take away that there will be a good measurable size hit to such a change. On the other hand, you could do cool norms reopen stuff with the method calls (eg have each IndexReader maintain its own norms and MultiReaders sub delegate norms calls)...
        Hide
        svella Shon Vella added a comment -

        Without this sort of change, searching a large index (think 100 million or more) uses an inordinate amount of heap.

        Show
        svella Shon Vella added a comment - Without this sort of change, searching a large index (think 100 million or more) uses an inordinate amount of heap.
        Hide
        thetaphi Uwe Schindler added a comment -

        In my opinion the problem with large indexes is more, that each SegmentReader has a cache of the last used norms. If you have many fields with norms enabled the cache grows and is never freed. In my opinion, the cache should be a LRU cache or a WeakHashMap or something like that.
        You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).

        Show
        thetaphi Uwe Schindler added a comment - In my opinion the problem with large indexes is more, that each SegmentReader has a cache of the last used norms. If you have many fields with norms enabled the cache grows and is never freed. In my opinion, the cache should be a LRU cache or a WeakHashMap or something like that. You can see this problem, if you create an index with many fields with norms (I tested with about 4,000 fields) and many documents (half a million). If you then call CheckIndex, that calls norms() for each field in the Segment and each of this calls creates a new cache entry, you get OutOfMemoryExceptions after short time (I tested with the above index: I was not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
        Hide
        mikemccand Michael McCandless added a comment -

        In my opinion the problem with large indexes is more, that each SegmentReader has a cache of the last used norms.

        I believe when MultiReader.norms is called (as Doug & Yonik said above), the underlying SegmentReaders do not in fact cache the norms (this is not readily obvious until you scrutinize the code). Ie, it's only MultiReader that caches the full array.

        But I agree there would be good benefits (not creating fakeNorms) to moving away from byte[] for norms. I think an iterator only API might be fine (giving us more freedom on the impl.), though I would worry about performance impact.

        Or we could make a new method to replace norms() that returns null when the field has no norms, and then Scorers that use this API would handle the null correctly. We could fix all core/contribs to use the new API...

        Also note that with LUCENE-1483, we are moving to searching each segment at a time, so MultiReader.norms should not normally be called, unless it doesn't expose its underlying readers.

        Show
        mikemccand Michael McCandless added a comment - In my opinion the problem with large indexes is more, that each SegmentReader has a cache of the last used norms. I believe when MultiReader.norms is called (as Doug & Yonik said above), the underlying SegmentReaders do not in fact cache the norms (this is not readily obvious until you scrutinize the code). Ie, it's only MultiReader that caches the full array. But I agree there would be good benefits (not creating fakeNorms) to moving away from byte[] for norms. I think an iterator only API might be fine (giving us more freedom on the impl.), though I would worry about performance impact. Or we could make a new method to replace norms() that returns null when the field has no norms, and then Scorers that use this API would handle the null correctly. We could fix all core/contribs to use the new API... Also note that with LUCENE-1483 , we are moving to searching each segment at a time, so MultiReader.norms should not normally be called, unless it doesn't expose its underlying readers.
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        In my opinion the problem with large indexes is more, that each SegmentReader has a cache of the last used norms.

        I believe when MultiReader.norms is called (as Doug & Yonik said above), the underlying SegmentReaders do not in fact cache the norms (this is not readily obvious until you scrutinize the code). Ie, it's only MultiReader that caches the full array.

        In my opinion, this is not correct. I did not use a MultiReader. CheckIndex opens and then tests each segment with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal.

        The code taken from SegmentReader is here:

          protected synchronized byte[] getNorms(String field) throws IOException {
            Norm norm = (Norm) norms.get(field);
            if (norm == null) return null;  // not indexed, or norms not stored
            synchronized(norm) {
              if (norm.bytes == null) {                     // value not yet read
                byte[] bytes = new byte[maxDoc()];
                norms(field, bytes, 0);
                norm.bytes = bytes;                         // cache it
                // it's OK to close the underlying IndexInput as we have cached the
                // norms and will never read them again.
                norm.close();
              }
              return norm.bytes;
            }
          }
        

        Each reader contains a Map of Norm entries for each field. When for the first time the norm for a specific field are read, norm.bytes==null and then it is cached inside this Norm object. And it is never freed.

        In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching.

        I will prepare a patch, should I open a new issue for that? I found this problem yesterday when testing with very large indexes (you may have noticed my mail about removing norms from Trie fields).

        Uwe

        Show
        thetaphi Uwe Schindler added a comment - - edited In my opinion the problem with large indexes is more, that each SegmentReader has a cache of the last used norms. I believe when MultiReader.norms is called (as Doug & Yonik said above), the underlying SegmentReaders do not in fact cache the norms (this is not readily obvious until you scrutinize the code). Ie, it's only MultiReader that caches the full array. In my opinion, this is not correct. I did not use a MultiReader. CheckIndex opens and then tests each segment with a separate SegmentReader. The big index with the OutOfMemory problem was optimized, so consisting of one segment with about half a million docs and about 4,000 fields. Each byte[] array takes about a half MiB for this index. The CheckIndex funtion created the norm for 4000 fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are not unusal. The code taken from SegmentReader is here: protected synchronized byte [] getNorms( String field) throws IOException { Norm norm = (Norm) norms.get(field); if (norm == null ) return null ; // not indexed, or norms not stored synchronized (norm) { if (norm.bytes == null ) { // value not yet read byte [] bytes = new byte [maxDoc()]; norms(field, bytes, 0); norm.bytes = bytes; // cache it // it's OK to close the underlying IndexInput as we have cached the // norms and will never read them again. norm.close(); } return norm.bytes; } } Each reader contains a Map of Norm entries for each field. When for the first time the norm for a specific field are read, norm.bytes==null and then it is cached inside this Norm object. And it is never freed. In my opinion, the best would be to use a Weak- or better a SoftReference so norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching. I will prepare a patch, should I open a new issue for that? I found this problem yesterday when testing with very large indexes (you may have noticed my mail about removing norms from Trie fields). Uwe
        Hide
        mikemccand Michael McCandless added a comment -

        CheckIndex opens and then tests each segment with a separate SegmentReader.

        You're right: CheckIndex loads the norms of every field (even those
        w/o norms), and then that memory is not released until that reader is
        closed & unreferenced.

        But that's a different issue than this one (this one is about MultiReader).

        Can you open a new issue? I don't think Soft/WeakReference is the right
        solution (they give us little control on when the cache is evicted); we
        could do something first specifically for CheckIndex (eg it could
        simply use the 3-arg non-caching bytes method instead) to prevent OOM
        errors when using it.

        Show
        mikemccand Michael McCandless added a comment - CheckIndex opens and then tests each segment with a separate SegmentReader. You're right: CheckIndex loads the norms of every field (even those w/o norms), and then that memory is not released until that reader is closed & unreferenced. But that's a different issue than this one (this one is about MultiReader). Can you open a new issue? I don't think Soft/WeakReference is the right solution (they give us little control on when the cache is evicted); we could do something first specifically for CheckIndex (eg it could simply use the 3-arg non-caching bytes method instead) to prevent OOM errors when using it.
        Hide
        thetaphi Uwe Schindler added a comment -

        Mike: I created new issue LUCENE-1520 and added some remarks to your last message, too.

        Show
        thetaphi Uwe Schindler added a comment - Mike: I created new issue LUCENE-1520 and added some remarks to your last message, too.
        Hide
        thetaphi Uwe Schindler added a comment -

        Since Lucene 2.9 we search on each segment separately, so MultiReader's norms cache would never be used, exept in custom code that calls norms() on the MultiReader/DirectoryReader. Since Lucene 4.0 this is also not allowed anymore, non-atomic readers don't support norms. If you still need to get global norms, you can use MultiNorms but that is discouraged.

        See also: LUCENE-2771

        Show
        thetaphi Uwe Schindler added a comment - Since Lucene 2.9 we search on each segment separately, so MultiReader's norms cache would never be used, exept in custom code that calls norms() on the MultiReader/DirectoryReader. Since Lucene 4.0 this is also not allowed anymore, non-atomic readers don't support norms. If you still need to get global norms, you can use MultiNorms but that is discouraged. See also: LUCENE-2771

          People

          • Assignee:
            Unassigned
            Reporter:
            tamm Steven Tamm
          • Votes:
            3 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development