Lucene.Net
  1. Lucene.Net
  2. LUCENENET-488

Can't open IndexReader, get OutOFMemory Exception

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: Lucene.Net 2.9.4g
    • Fix Version/s: None
    • Component/s: Lucene.Net Core
    • Labels:
      None
    • Environment:

      Windows server 2008R2

      Description

      Have build a large database with ~1Bn records (2 items per document) it has size 200GB on disk. I managed to write the indexe by chunking into 100,000 blocks as I ended up with some threading issues (another bug submission). Anyway the index is built but I can't open it and get a memory exception (process explorer gets to 1.5GB allocated before it dies but not sure how reliable that is, but do know there is plenty more RAM left on the box).
      Stack trace below:

      System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was
      thrown.
      at Lucene.Net.Index.TermInfosReader..ctor(Directory dir, String seg, FieldInf
      os fis, Int32 readBufferSize, Int32 indexDivisor)
      at Lucene.Net.Index.SegmentReader.CoreReaders..ctor(SegmentReader origInstanc
      e, Directory dir, SegmentInfo si, Int32 readBufferSize, Int32 termsIndexDivisor)

      at Lucene.Net.Index.SegmentReader.Get(Boolean readOnly, Directory dir, Segmen
      tInfo si, Int32 readBufferSize, Boolean doOpenStores, Int32 termInfosIndexDiviso
      r)
      at Lucene.Net.Index.SegmentReader.Get(Boolean readOnly, SegmentInfo si, Int32
      termInfosIndexDivisor)
      at Lucene.Net.Index.DirectoryReader..ctor(Directory directory, SegmentInfos s
      is, IndexDeletionPolicy deletionPolicy, Boolean readOnly, Int32 termInfosIndexDi
      visor)
      at Lucene.Net.Index.DirectoryReader.<>c_DisplayClass1.<Open>b_0(String segm
      entFileName)
      at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run(IndexCommit commit)
      at Lucene.Net.Index.DirectoryReader.Open(Directory directory, IndexDeletionPo
      licy deletionPolicy, IndexCommit commit, Boolean readOnly, Int32 termInfosIndexD
      ivisor)
      at Lucene.Net.Index.IndexReader.Open(String path, Boolean readOnly)
      at Lucene.Net.Demo.SearchFiles.Main(String[] args)

        Activity

        Hide
        Prescott Nasser added a comment -

        Intended - see comments by Sven for more details

        Show
        Prescott Nasser added a comment - Intended - see comments by Sven for more details
        Hide
        Steven added a comment -

        I do agree, and the info above is very helpful. I guess now at least others will find the issue and its resolution should they be faced with the same issue so thanks for your help.

        Show
        Steven added a comment - I do agree, and the info above is very helpful. I guess now at least others will find the issue and its resolution should they be faced with the same issue so thanks for your help.
        Hide
        Simon Svensson added a comment -

        The following may be off since I don't know the inner technical workings of Lucene.Net.

        All terms in your index is read into an in-memory index when opening an IndexReader. The termInfosIndexDivisor tells the IndexReader instance to read every n-th term into this index. The default value, 1, will cause every term to be loaded into memory. Using termIndexIndexDivisor=2 means that you'll read every second term into memory, theoretically halving the required memory size. Your value, 10, would only consume a tenth of the memory compared to termIndexDivisor=1.

        This comes to a price; as 9 out of 10 terms are not cached in memory they take longer time to retrieve. This is done in many cases, like a new TermQuery("f", "test"). It needs to seek to the indexed term, then iterate forward until it matches the correct term. This could be, if "teargas" was the indexed term; teargas > technicians > tegument > teleconference > temporal > tenotomy > teocalli > terbium > test. Instead of being able to directly seek to the term, we now seek to a term before, and iterate the list for another 8 terms. (It would still go faster than the time it took for me to find odd example words...)

        I've never measured this, but I doubt that low numbers will cause much trouble. Any term except "teargas" would need to read the term information from disk, and this disk read will [probably] end up in the file system cache. I can see a problem if you have numbers high enough causing a second disk read, but at what value of termInfosIndexDivisor this happens is system-dependent. The size of the disk reads, the amount of data per term, etc, would affect this. I guess you could use a low-level monitoring tool (Process Monitor?) to see every read if you really want to find the "perfect" number.

        I believe this bug report can be closed as invalid; it was a case of default values that did not work out for 200 GiB indexes. Do you agree on this, Steven?

        Show
        Simon Svensson added a comment - The following may be off since I don't know the inner technical workings of Lucene.Net. All terms in your index is read into an in-memory index when opening an IndexReader. The termInfosIndexDivisor tells the IndexReader instance to read every n-th term into this index. The default value, 1, will cause every term to be loaded into memory. Using termIndexIndexDivisor=2 means that you'll read every second term into memory, theoretically halving the required memory size. Your value, 10, would only consume a tenth of the memory compared to termIndexDivisor=1. This comes to a price; as 9 out of 10 terms are not cached in memory they take longer time to retrieve. This is done in many cases, like a new TermQuery("f", "test"). It needs to seek to the indexed term, then iterate forward until it matches the correct term. This could be, if "teargas" was the indexed term; teargas > technicians > tegument > teleconference > temporal > tenotomy > teocalli > terbium > test. Instead of being able to directly seek to the term, we now seek to a term before, and iterate the list for another 8 terms. (It would still go faster than the time it took for me to find odd example words...) I've never measured this, but I doubt that low numbers will cause much trouble. Any term except "teargas" would need to read the term information from disk, and this disk read will [probably] end up in the file system cache. I can see a problem if you have numbers high enough causing a second disk read, but at what value of termInfosIndexDivisor this happens is system-dependent. The size of the disk reads, the amount of data per term, etc, would affect this. I guess you could use a low-level monitoring tool (Process Monitor?) to see every read if you really want to find the "perfect" number. I believe this bug report can be closed as invalid; it was a case of default values that did not work out for 200 GiB indexes. Do you agree on this, Steven?
        Hide
        Steven added a comment -

        Hi Simon, thanks very much, set the option to 10 (have no idea what that means but it works) the reader open in about 4 seconds but the search is still hugely impressive (300ms to search through 1bn records and return the first 10).
        I will try to build a native 64bit version on the server itself (my development box is only 32 which might be the problem) and let you know how I get on.
        Thanks again, can't believe you guys do this for free, I pay millions for products that aren't any where near as good!

        Show
        Steven added a comment - Hi Simon, thanks very much, set the option to 10 (have no idea what that means but it works) the reader open in about 4 seconds but the search is still hugely impressive (300ms to search through 1bn records and return the first 10). I will try to build a native 64bit version on the server itself (my development box is only 32 which might be the problem) and let you know how I get on. Thanks again, can't believe you guys do this for free, I pay millions for products that aren't any where near as good!
        Hide
        Simon Svensson added a comment -

        The 1.5 GiB limit sounds like you're executing a 32bit application. Is this correct?

        Does it work if you're calling the overload of IndexReader.Open which accepts a termInfosIndexDivisor directly? (You can pass null for deletion policy to use the default deletion policy.) The default termInfosIndexDivisor is one, increasing it will decrease the amount of memory required. This will slow down some term-related operations against the index, but it sounds better than not being able to open it at all.

        There are some information about what data is loaded into memory at http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html

        Show
        Simon Svensson added a comment - The 1.5 GiB limit sounds like you're executing a 32bit application. Is this correct? Does it work if you're calling the overload of IndexReader.Open which accepts a termInfosIndexDivisor directly? (You can pass null for deletion policy to use the default deletion policy.) The default termInfosIndexDivisor is one, increasing it will decrease the amount of memory required. This will slow down some term-related operations against the index, but it sounds better than not being able to open it at all. There are some information about what data is loaded into memory at http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html

          People

          • Assignee:
            Unassigned
            Reporter:
            Steven
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development